Memory Design
What survives outside the window — a taxonomy of memory types, when to write vs. when to recall, structured notes as durable planning state, and how memory invalidates
Why Context Window Is Not Memory
The context window is working memory, not storage. Three properties make it unsuitable as long-term memory:
- Bounded — no matter how large, it fills up. Any strategy that requires “just keep everything” breaks at scale.
- Lossy under compaction — summarization preserves gist, not detail. The specific thing you need three sessions later may be the first thing compaction drops.
- Per-session — a new conversation starts empty. Without an external mechanism, nothing learned in session A is available in session B.
A real memory system lives outside the window. Writes and reads are explicit operations the agent performs through tools. The window is the cache; memory is the disk.
This framing matters because it changes the design question. “What should the agent remember?” becomes:
- What is written to the external store?
- What triggers retrieval?
- How much is loaded at any given moment?
- How does stale memory get pruned?
Each question has different right answers depending on the type of memory.
A Memory Taxonomy
Cognitive science distinguishes several memory systems — episodic, semantic, procedural, working. The same taxonomy is a useful guide for agent design, because the four types have genuinely different read/write patterns.
| Type | What it holds | When it’s written | When it’s read | Example |
|---|---|---|---|---|
| Episodic | Events: what happened, when, with whom | After notable events (completions, errors) | When similar situations recur | ”Last month’s deploy broke because of X” |
| Semantic | Facts, preferences, rules | When the user states or demonstrates them | When relevant to any decision | ”User prefers Tailwind over CSS-in-JS” |
| Procedural | How to do something — skills, playbooks | After learning a reliable procedure | When the task matches a known procedure | ”How this team writes commit messages” |
| Working | Current-task state | During the task | Throughout the task; discarded at completion | Open file list, current plan, pending todos |
Working memory belongs inside the context window (augmented by structured notes — see below). The other three types belong outside, in persistent storage.
This taxonomy is not a prescription that you must build four separate stores. It is a classification tool: when you have a piece of information to save, name what kind it is, then the write and recall strategies follow.
Why the distinction matters
- Episodic needs retrieval by similarity. “Did this situation happen before?” is a search problem, not an always-load problem. Keyword or vector retrieval fits.
- Semantic is cheaper to always load. User preferences are small and apply everywhere. Putting them in the system prompt skeleton often beats a retrieval round trip.
- Procedural is often skill-shaped. A playbook for “deploying to staging” wants to be retrieved when the user’s current task matches that procedure — not always.
- Working memory is disposable. It costs nothing to lose between tasks; trying to persist it creates a mess of stale state.
Confusing the types is a common failure mode. Example: putting episodic “this one PR had issue Y” into always-loaded semantic memory. Twenty PRs later, the system prompt is bloated with irrelevant war stories.
In-Context vs Persisted
The line between “keep in the window” and “externalize” is fuzzy. A practical rule:
Externalize anything that would be expensive to re-derive and inexpensive to retrieve.
Expensive to re-derive: decisions made after deliberation, user-stated preferences, project-specific conventions, facts about the environment discovered through tools.
Cheap to retrieve: small, stable, indexable by a keyword or path the agent already knows.
The fuzzy middle is conversation state mid-task — open files, current hypothesis, pending decisions. This is working memory. Two choices:
- Keep in window — simple, but vulnerable to compaction.
- Externalize as structured notes — durable, but adds friction.
For short tasks, keep in window. For long tasks, externalize early (see Structured Notes below).
Write Policy
When should the agent write to memory? Three triggers:
| Trigger | Signal | Write what |
|---|---|---|
| User says so | ”Remember that…”, “From now on…”, “Next time…” | Near-verbatim, with the user’s own framing |
| Correction | ”No, don’t do X”, “Actually, the right answer is Y” | The rule plus why — so edge cases can be judged later |
| Confirmed success | User accepts without pushback, especially a non-obvious choice | The approach + context (“in situation S, X worked”) |
The third trigger is the one most often missed. Saving only corrections teaches the agent what to avoid but not what to repeat — over time it becomes overly cautious, losing the instincts the user already validated. Save confirmed successes with equal weight.
What not to write:
- Facts derivable from current code or docs (
git logis authoritative, don’t snapshot it) - Ephemeral task state (current branch, open files)
- Content already in a CLAUDE.md-style project instruction file
- Anything the user said once in anger but hasn’t reaffirmed
Signal over coverage
When writing, optimize for signal density, not completeness. A memory entry is going to be read repeatedly over months. Every word matters. Good memory entries:
- Lead with the rule or fact
- Include why — the user’s stated or inferred reason
- Include how to apply — the situations where it kicks in
- Are short enough to read at a glance
Bad memory entries recount what happened in this conversation. Chronicles rot fast.
Recall Policy
Retrieval is where most memory systems quietly fail. Two common patterns:
Eager (always-loaded)
An index of memory lives in the system prompt. The agent sees “there are 12 memories, here are their titles” on every turn. Full content is fetched only when a memory looks relevant.
Good for: small, semantic-type memories. Stable preferences, project conventions, user identity.
Bad for: episodic trivia. A system prompt listing 200 past incidents is noise.
Lazy (on-demand)
The agent has a recall_memory(query) tool. It calls the tool when it thinks memory is relevant; the tool returns
matching entries.
Good for: large stores, episodic memory, anything where relevance is case-by-case.
Bad for: memories the agent won’t know to ask about. If the user said “always use British English” two weeks ago and the agent never calls recall, it will write American English and feel fine about it.
Hybrid
The common sweet spot is index eagerly, content lazily: the agent always sees the index (titles + one-line hooks), and fetches full content on demand. This keeps the skeleton small while giving the agent the vocabulary to know what exists to retrieve.
Recall budget
How many memories per retrieval is a tuning parameter. Returning 20 entries fills the context with noise; returning 1 misses cases where two memories both apply. A reasonable default is 3-5, ranked by relevance. If the store is large enough that ranking is hard, consider LLM-side-query (a small model decides which memories to fetch based on the user’s turn).
The “doesn’t know to ask” failure
The subtlest recall failure is the one where the agent should have retrieved a memory but never called the tool. From the agent’s perspective, nothing went wrong — it answered confidently, based on defaults. From the user’s perspective, the agent ignored guidance it had been given. These failures are under-reported because neither party sees the missing memory.
Four mechanisms reduce “doesn’t know to ask” risk, in order of heaviest to lightest touch:
-
Always-loaded index. The memory index lives in the system prompt; the agent sees “there are memories named X, Y, Z” on every turn. It still has to choose to recall, but at least the vocabulary is present. Works best when the index is small (≤50 items).
-
Trigger cues in the system prompt. Name the situations in which recall is mandatory, not optional: “Before answering a question about project conventions, recall memories tagged
convention.” This converts a judgment call into a rule. -
Auto-prepend by task classification. A lightweight classifier tags each user turn with topic labels; memories whose titles match those labels are prepended to context before the agent sees the turn. Retrieval becomes involuntary — the agent can’t forget to call it, because it’s already done.
-
LLM-side-query routing. A cheaper, faster model reads the user’s turn and decides which memories to fetch; results are injected as context for the main model. Effective for large stores where hand-tuning the rules above becomes infeasible.
The common progression: start with (1) and (2). Add (3) only when manual recall misses are observed at rate. Reach for (4) when the store outgrows what the main model can pick from reliably.
Structured Notes
Working memory and structured notes solve the same problem at different scales:
- Working memory is the task state the agent holds inside the window while it works — the current plan, the last few files it touched, its pending todos. For short tasks, this is sufficient.
- Structured notes are working memory externalized to disk when the window can no longer hold it reliably — because the task is long, the context might compact, or the session might end.
The two are not different types of memory; they are the same type at different durabilities. Notes inherit the role of working memory the moment the agent can’t trust the window to still contain what it needs.
Structured note-taking is therefore a special case of memory: the agent writes semi-permanent state files during a task, then reads them back across context resets.
The agent maintains a
NOTES.mdfile it updates as it works. After a context reset, it reads the notes and continues. — Anthropic, Effective Context Engineering for AI Agents
Typical structured notes:
- Todo list — what’s done, what’s next, what’s blocked
- Progress journal — a running log of what was tried and what worked
- Architecture decisions — design choices made during the task, with rationale
- Open questions — things to ask the user at the next checkpoint
Why this works: files are lossless, always available, and don’t consume attention budget until read. The agent can carry arbitrarily long task state across context resets by externalizing it to disk.
Two design choices govern structured notes:
Immutability of completion. Once a todo is marked done, it should not silently revert. If the agent re-plans and the list appears to lose a done item, treat that as a bug — the item either stays done or is explicitly deleted. This prevents the “agent forgot it already did X and redid it” failure.
Write discipline. Notes are only useful if they stay current. The best convention is: write after each meaningful action, not at the end of the task. A mid-task context reset must find the notes correct, or they become worse than useless.
Structured notes are the single technique most often cited as unlocking truly long-horizon agents. Agents that write and read disk files routinely outperform agents that try to keep everything in the window — even when the window is theoretically large enough.
Invalidation
Memories go stale. Three mechanisms keep a memory store honest:
| Mechanism | When it applies | How |
|---|---|---|
| Verification | Memory references a specific file, flag, or function | Before acting on it, grep / check that the thing still exists |
| Correction | Recalled memory contradicts current observation | Trust the observation; update or delete the memory |
| Expiration | Memory is time-bound (deadlines, sprint state) | Convert relative dates to absolute when writing; decay after expiry |
The common failure is acting on stale memory without verification. Example: memory says “use createFoo()
helper”, but a refactor renamed it to makeFoo(). If the agent retrieves the memory and acts on it without checking,
it writes broken code and adds the original memory to a mental “trusted” list — compounding the error.
A good default: memory entries that name a specific identifier are a claim that the identifier existed when written. Before acting on them, verify.
Anti-Patterns
Authoring mistakes — common ways memory design goes wrong at design time. For runtime behavior that fails despite good design, see Context Management → Failure Modes.
| Anti-pattern | Why it hurts |
|---|---|
| Memory as transcript | Writing “we talked about X today” produces zero retrieval value. Extract the insight. |
| Only writing corrections | Agent grows overly cautious. Also save confirmed-successful approaches. |
| Unbounded store | Without pruning, the index becomes too long to be useful. Cap and rotate. |
| Writing project state that code has | git log, file contents, and architecture diagrams are authoritative. Don’t duplicate. |
| Memory without why | A rule without reason can’t be judged against edge cases. Always include the motivation. |
| Vague recall queries | ”Relevant memories” as a query returns noise. Prefer specific keywords tied to the task. |
Composing With the Other Pillars
Memory does not work alone:
-
Prompt design decides how the memory index is rendered in the skeleton, and with what description the recall tool is advertised to the model. A recall tool described as “fetch old notes” gets called less than one described as “retrieve prior decisions, preferences, and rules for this user and project”.
-
Context management decides how retrieved memories are kept or pruned as the conversation progresses. A retrieved memory that becomes irrelevant should age out with other tool results, not permanently inflate the window. Compaction strategies must account for memory entries — preserve critical ones, drop forgotten ones.
Memory is often the first pillar an agent outgrows. A well-designed memory system turns a one-shot assistant into an agent that genuinely learns over time — and until you have it, no amount of prompt tuning will get you there.
Measuring It
Memory is the hardest pillar to evaluate because its benefits are delayed. A correct memory write pays off three weeks later, when recalled. Measure along these axes:
- Write correctness — for a sample of writes, is the entry accurate, non-duplicative, appropriately scoped? Sample manually; pattern-matching works at scale.
- Recall precision / recall — did the agent retrieve the memories that were actually relevant to the turn? Over a synthetic test set where you know the ground truth, measure.
- Recall omission — the failure mode of “the agent should have recalled but didn’t”. Harder to measure because it’s a silent failure. Proxies: regression on tasks that worked with prior memory present.
- Staleness rate — what fraction of recalled memories references things that no longer exist? If over ~10%, invalidation discipline is slipping.
- Store growth — a memory store that only grows is a liability. Track new writes per week and prunes per week.
A practical eval: keep a synthetic “regression conversation” — a scripted multi-turn dialogue that relies on memories written in earlier turns. Running it periodically surfaces memory regressions the user would otherwise hit in the wild.
Related Reading
- Prompt Design — The memory index and recall tool’s description live inside the system prompt. How they’re framed there determines whether the agent uses memory at all.
- Context Management — Controls how retrieved memories age out alongside other context. A retrieved memory that becomes irrelevant should be pruned, not permanently inflate the window.
Sources
- Effective Context Engineering for AI Agents — Anthropic, 2026
- MemGPT: Towards LLMs as Operating Systems — Packer et al., 2023, for the memory hierarchy framing
- Generative Agents: Interactive Simulacra of Human Behavior — Park et al., 2023, for episodic / semantic memory in agent contexts