Memory Design

Why Context Window Is Not Memory

The context window is working memory, not storage. Three properties make it unsuitable as long-term memory:

Bounded — no matter how large, it fills up. Any strategy that requires “just keep everything” breaks at scale.
Lossy under compaction — summarization preserves gist, not detail. The specific thing you need three sessions later may be the first thing compaction drops.
Per-session — a new conversation starts empty. Without an external mechanism, nothing learned in session A is available in session B.

A real memory system lives outside the window. Writes and reads are explicit operations the agent performs through tools. The window is the cache; memory is the disk.

This framing matters because it changes the design question. “What should the agent remember?” becomes:

What is written to the external store?
What triggers retrieval?
How much is loaded at any given moment?
How does stale memory get pruned?

Each question has different right answers depending on the type of memory.

A Memory Taxonomy

Cognitive science distinguishes several memory systems — episodic, semantic, procedural, working. The same taxonomy is a useful guide for agent design, because the four types have genuinely different read/write patterns.

Type	What it holds	When it’s written	When it’s read	Example
Episodic	Events: what happened, when, with whom	After notable events (completions, errors)	When similar situations recur	”Last month’s deploy broke because of X”
Semantic	Facts, preferences, rules	When the user states or demonstrates them	When relevant to any decision	”User prefers Tailwind over CSS-in-JS”
Procedural	How to do something — skills, playbooks	After learning a reliable procedure	When the task matches a known procedure	”How this team writes commit messages”
Working	Current-task state	During the task	Throughout the task; discarded at completion	Open file list, current plan, pending todos

Working memory belongs inside the context window (augmented by structured notes — see below). The other three types belong outside, in persistent storage.

This taxonomy is not a prescription that you must build four separate stores. It is a classification tool: when you have a piece of information to save, name what kind it is, then the write and recall strategies follow.

Why the distinction matters

Episodic needs retrieval by similarity. “Did this situation happen before?” is a search problem, not an always-load problem. Keyword or vector retrieval fits.
Semantic is cheaper to always load. User preferences are small and apply everywhere. Putting them in the system prompt skeleton often beats a retrieval round trip.
Procedural is often skill-shaped. A playbook for “deploying to staging” wants to be retrieved when the user’s current task matches that procedure — not always.
Working memory is disposable. It costs nothing to lose between tasks; trying to persist it creates a mess of stale state.

Confusing the types is a common failure mode. Example: putting episodic “this one PR had issue Y” into always-loaded semantic memory. Twenty PRs later, the system prompt is bloated with irrelevant war stories.

In-Context vs Persisted

The line between “keep in the window” and “externalize” is fuzzy. A practical rule:

Externalize anything that would be expensive to re-derive and inexpensive to retrieve.

Expensive to re-derive: decisions made after deliberation, user-stated preferences, project-specific conventions, facts about the environment discovered through tools.

Cheap to retrieve: small, stable, indexable by a keyword or path the agent already knows.

The fuzzy middle is conversation state mid-task — open files, current hypothesis, pending decisions. This is working memory. Two choices:

Keep in window — simple, but vulnerable to compaction.
Externalize as structured notes — durable, but adds friction.

For short tasks, keep in window. For long tasks, externalize early (see Structured Notes below).

Write Policy

When should the agent write to memory? Three triggers:

Trigger	Signal	Write what
User says so	”Remember that…”, “From now on…”, “Next time…”	Near-verbatim, with the user’s own framing
Correction	”No, don’t do X”, “Actually, the right answer is Y”	The rule plus why — so edge cases can be judged later
Confirmed success	User accepts without pushback, especially a non-obvious choice	The approach + context (“in situation S, X worked”)

The third trigger is the one most often missed. Saving only corrections teaches the agent what to avoid but not what to repeat — over time it becomes overly cautious, losing the instincts the user already validated. Save confirmed successes with equal weight.

What not to write:

Facts derivable from current code or docs (git log is authoritative, don’t snapshot it)
Ephemeral task state (current branch, open files)
Content already in a CLAUDE.md-style project instruction file
Anything the user said once in anger but hasn’t reaffirmed

Signal over coverage

When writing, optimize for signal density, not completeness. A memory entry is going to be read repeatedly over months. Every word matters. Good memory entries:

Lead with the rule or fact
Include why — the user’s stated or inferred reason
Include how to apply — the situations where it kicks in
Are short enough to read at a glance

Bad memory entries recount what happened in this conversation. Chronicles rot fast.

Recall Policy

Retrieval is where most memory systems quietly fail. Two common patterns:

Eager (always-loaded)

An index of memory lives in the system prompt. The agent sees “there are 12 memories, here are their titles” on every turn. Full content is fetched only when a memory looks relevant.

Good for: small, semantic-type memories. Stable preferences, project conventions, user identity.

Bad for: episodic trivia. A system prompt listing 200 past incidents is noise.

Lazy (on-demand)

The agent has a recall_memory(query) tool. It calls the tool when it thinks memory is relevant; the tool returns matching entries.

Good for: large stores, episodic memory, anything where relevance is case-by-case.

Bad for: memories the agent won’t know to ask about. If the user said “always use British English” two weeks ago and the agent never calls recall, it will write American English and feel fine about it.

Hybrid

The common sweet spot is index eagerly, content lazily: the agent always sees the index (titles + one-line hooks), and fetches full content on demand. This keeps the skeleton small while giving the agent the vocabulary to know what exists to retrieve.

Recall budget

How many memories per retrieval is a tuning parameter. Returning 20 entries fills the context with noise; returning 1 misses cases where two memories both apply. A reasonable default is 3-5, ranked by relevance. If the store is large enough that ranking is hard, consider LLM-side-query (a small model decides which memories to fetch based on the user’s turn).

The “doesn’t know to ask” failure

The subtlest recall failure is the one where the agent should have retrieved a memory but never called the tool. From the agent’s perspective, nothing went wrong — it answered confidently, based on defaults. From the user’s perspective, the agent ignored guidance it had been given. These failures are under-reported because neither party sees the missing memory.

Four mechanisms reduce “doesn’t know to ask” risk, in order of heaviest to lightest touch:

Always-loaded index. The memory index lives in the system prompt; the agent sees “there are memories named X, Y, Z” on every turn. It still has to choose to recall, but at least the vocabulary is present. Works best when the index is small (≤50 items).
Trigger cues in the system prompt. Name the situations in which recall is mandatory, not optional: “Before answering a question about project conventions, recall memories tagged convention.” This converts a judgment call into a rule.
Auto-prepend by task classification. A lightweight classifier tags each user turn with topic labels; memories whose titles match those labels are prepended to context before the agent sees the turn. Retrieval becomes involuntary — the agent can’t forget to call it, because it’s already done.
LLM-side-query routing. A cheaper, faster model reads the user’s turn and decides which memories to fetch; results are injected as context for the main model. Effective for large stores where hand-tuning the rules above becomes infeasible.

The common progression: start with (1) and (2). Add (3) only when manual recall misses are observed at rate. Reach for (4) when the store outgrows what the main model can pick from reliably.

Structured Notes

Working memory and structured notes solve the same problem at different scales:

Working memory is the task state the agent holds inside the window while it works — the current plan, the last few files it touched, its pending todos. For short tasks, this is sufficient.
Structured notes are working memory externalized to disk when the window can no longer hold it reliably — because the task is long, the context might compact, or the session might end.

The two are not different types of memory; they are the same type at different durabilities. Notes inherit the role of working memory the moment the agent can’t trust the window to still contain what it needs.

Structured note-taking is therefore a special case of memory: the agent writes semi-permanent state files during a task, then reads them back across context resets.

The agent maintains a NOTES.md file it updates as it works. After a context reset, it reads the notes and continues. — Anthropic, Effective Context Engineering for AI Agents

Typical structured notes:

Todo list — what’s done, what’s next, what’s blocked
Progress journal — a running log of what was tried and what worked
Architecture decisions — design choices made during the task, with rationale
Open questions — things to ask the user at the next checkpoint

Why this works: files are lossless, always available, and don’t consume attention budget until read. The agent can carry arbitrarily long task state across context resets by externalizing it to disk.

Two design choices govern structured notes:

Immutability of completion. Once a todo is marked done, it should not silently revert. If the agent re-plans and the list appears to lose a done item, treat that as a bug — the item either stays done or is explicitly deleted. This prevents the “agent forgot it already did X and redid it” failure.

Write discipline. Notes are only useful if they stay current. The best convention is: write after each meaningful action, not at the end of the task. A mid-task context reset must find the notes correct, or they become worse than useless.

Structured notes are the single technique most often cited as unlocking truly long-horizon agents. Agents that write and read disk files routinely outperform agents that try to keep everything in the window — even when the window is theoretically large enough.

Invalidation

Memories go stale. Three mechanisms keep a memory store honest:

Mechanism	When it applies	How
Verification	Memory references a specific file, flag, or function	Before acting on it, grep / check that the thing still exists
Correction	Recalled memory contradicts current observation	Trust the observation; update or delete the memory
Expiration	Memory is time-bound (deadlines, sprint state)	Convert relative dates to absolute when writing; decay after expiry

The common failure is acting on stale memory without verification. Example: memory says “use createFoo() helper”, but a refactor renamed it to makeFoo(). If the agent retrieves the memory and acts on it without checking, it writes broken code and adds the original memory to a mental “trusted” list — compounding the error.

A good default: memory entries that name a specific identifier are a claim that the identifier existed when written. Before acting on them, verify.

Anti-Patterns

Authoring mistakes — common ways memory design goes wrong at design time. For runtime behavior that fails despite good design, see Context Management → Failure Modes.

Anti-pattern	Why it hurts
Memory as transcript	Writing “we talked about X today” produces zero retrieval value. Extract the insight.
Only writing corrections	Agent grows overly cautious. Also save confirmed-successful approaches.
Unbounded store	Without pruning, the index becomes too long to be useful. Cap and rotate.
Writing project state that code has	`git log`, file contents, and architecture diagrams are authoritative. Don’t duplicate.
Memory without why	A rule without reason can’t be judged against edge cases. Always include the motivation.
Vague recall queries	”Relevant memories” as a query returns noise. Prefer specific keywords tied to the task.

Composing With the Other Pillars

Memory does not work alone:

Prompt design decides how the memory index is rendered in the skeleton, and with what description the recall tool is advertised to the model. A recall tool described as “fetch old notes” gets called less than one described as “retrieve prior decisions, preferences, and rules for this user and project”.
Context management decides how retrieved memories are kept or pruned as the conversation progresses. A retrieved memory that becomes irrelevant should age out with other tool results, not permanently inflate the window. Compaction strategies must account for memory entries — preserve critical ones, drop forgotten ones.

Memory is often the first pillar an agent outgrows. A well-designed memory system turns a one-shot assistant into an agent that genuinely learns over time — and until you have it, no amount of prompt tuning will get you there.

Measuring It

Memory is the hardest pillar to evaluate because its benefits are delayed. A correct memory write pays off three weeks later, when recalled. Measure along these axes:

Write correctness — for a sample of writes, is the entry accurate, non-duplicative, appropriately scoped? Sample manually; pattern-matching works at scale.
Recall precision / recall — did the agent retrieve the memories that were actually relevant to the turn? Over a synthetic test set where you know the ground truth, measure.
Recall omission — the failure mode of “the agent should have recalled but didn’t”. Harder to measure because it’s a silent failure. Proxies: regression on tasks that worked with prior memory present.
Staleness rate — what fraction of recalled memories references things that no longer exist? If over ~10%, invalidation discipline is slipping.
Store growth — a memory store that only grows is a liability. Track new writes per week and prunes per week.

A practical eval: keep a synthetic “regression conversation” — a scripted multi-turn dialogue that relies on memories written in earlier turns. Running it periodically surfaces memory regressions the user would otherwise hit in the wild.

Prompt Design — The memory index and recall tool’s description live inside the system prompt. How they’re framed there determines whether the agent uses memory at all.
Context Management — Controls how retrieved memories age out alongside other context. A retrieved memory that becomes irrelevant should be pruned, not permanently inflate the window.

Sources

Effective Context Engineering for AI Agents — Anthropic, 2026
MemGPT: Towards LLMs as Operating Systems — Packer et al., 2023, for the memory hierarchy framing
Generative Agents: Interactive Simulacra of Human Behavior — Park et al., 2023, for episodic / semantic memory in agent contexts