Memory system design

You've got four memory types: short-term, long-term facts, episodic, procedural. A real agent uses all of them, layered. The art is picking what to pull in when, and what to write out, so the agent has the right context at the right moment without drowning in its own history. This page is the architecture.

The layered stack

Most production agents don't need all five. The minimum useful system has short-term + user profile. Add vector and episodic when you want "remembers past conversations." Add procedures when you want the agent to get better at recurring tasks.

The retrieval stack: what gets pulled in when

  1. Always (every turn): user profile. Small, cheap, always relevant.
  2. On session start: recent episodes matching the current task.
  3. On demand: agent calls recall(topic) when it needs older context.
  4. Automatic vector pull: for chat-heavy assistants, vector-search the current turn and inject top-k matches.
  5. Procedure match: at task start, check if a known procedure matches.

Each retrieval path is a separate decision. Aggressive always-retrieve = slow, expensive, noisy. Too conservative = agent feels amnesic. Tune per retrieval path, not as one dial.

Writes: deliberate, not automatic

Writing is where memory systems rot if you're not careful. Default rules:

Expose memory as tools

The cleanest way to give the agent access to memory is to expose it as tools the model can call:

remember(fact)              # write to long-term facts
recall(topic)               # semantic search over memory
list_episodes(user, days)   # browse recent sessions
get_procedure(name)         # load a procedure
forget(fact_id)             # delete on user request

Now the model can decide when to reach for memory, same as it decides when to reach for a search tool. This keeps the loop clean and traceable.

A worked example: a coding assistant, all five layers

User: "Fix the failing tests in the billing module."

  1. Session context: the current conversation. Empty, first turn.
  2. User profile loaded: this user uses Python 3.12, prefers pytest, no JS background.
  3. Episode match: retrieved a session from 3 weeks ago where we fixed a different bug in the billing module. Context about that module's quirks is available.
  4. Procedure match: fix_failing_tests procedure retrieved. Steps: run the tests, read the failure, narrow to the specific assertion, read surrounding code, propose fix, re-run.
  5. Vector memory: on a specific turn, agent calls recall("billing timezone bug") and finds a note from the past saying the team standardized on UTC.

Five different memory layers, each contributing something the agent couldn't have figured out from scratch. The task that would've taken 20 tool calls takes 8.

Expiry and updates

Memory that never decays becomes wrong memory:

Isolation: never leak across users

Every memory operation must be scoped to a user ID. One missing filter somewhere and user A's private facts leak into user B's session. Build a memory layer API that takes user_id as a required arg everywhere; never let a raw DB query into your retrieval path.

Observability

Track: what memory was loaded per session, how much context it consumed, how often each retrieval path produced something the agent actually used. Unused retrievals are cost you can cut.

Pitfalls

What to do with this