Short-term memory

Short-term memory is everything that's in the model's context window right now. Not some exotic abstraction. Just: the system prompt, the user's question, the conversation so far, the tool calls made, the results that came back. That's what the model reads every time it takes a step. If you don't manage it, long sessions get slow, expensive, and, worst, wrong. The model starts forgetting what it was doing.

What's actually in short-term

Every LLM call includes all of the above. After 15 turns of tool use, that's 30+ blocks of content piling up. Which is where the problems start.

The long-session collapse

Left unmanaged, short-term memory explodes:

The cure is aggressive management: trim, summarize, or retrieve. Not "just use a longer context window."

Four strategies for keeping it lean

Keep in full, trim the rest

The always-keep list, regardless of session length:

The trim-aggressively list:

A worked example: long research session

After 20 turns, the agent's context is 80K tokens. You can either pay 20× the cost per subsequent turn, or intervene. Here's the intervention:

  1. Insert a summary turn. Replace turns 1-12 with: "Earlier in this session: user asked about X, I found sources A, B, C, key facts [list], decided to pursue angle Y."
  2. Keep turns 13-20 in full. These are where the model is currently working.
  3. Context drops from 80K to ~20K. Cost per turn falls 4×. Quality goes up because "lost in the middle" stops happening.

Do this automatically at a threshold (e.g., whenever context crosses 30K tokens).

Retrieval blurs the line

The advanced move: store everything externally (a vector store, a scratch file), and retrieve only what the current step needs. Short-term memory becomes a working set, not a log. This is how you get agents that run for hours without collapsing.

Pitfalls

What to do with this