Short-term memory
📖 3 min readUpdated 2026-04-19
Short-term memory is everything that's in the model's context window right now. Not some exotic abstraction. Just: the system prompt, the user's question, the conversation so far, the tool calls made, the results that came back. That's what the model reads every time it takes a step. If you don't manage it, long sessions get slow, expensive, and, worst, wrong. The model starts forgetting what it was doing.
What's actually in short-term
Every LLM call includes all of the above. After 15 turns of tool use, that's 30+ blocks of content piling up. Which is where the problems start.
The long-session collapse
Left unmanaged, short-term memory explodes:
- Cost grows linearly. Every call pays for the full context. A 50K-token context at 15 turns costs 15× what a 3K one does.
- Latency grows. Long contexts take longer to process.
- "Lost in the middle." Models attend less to content in the middle of long contexts. Important info from turn 3 gets ignored by turn 15.
- Reasoning decays. At very long contexts, models get confused about what step they're on and what they're trying to do.
The cure is aggressive management: trim, summarize, or retrieve. Not "just use a longer context window."
Four strategies for keeping it lean
Keep in full, trim the rest
The always-keep list, regardless of session length:
- System prompt (tells the model who it is)
- Original user request (the goal)
- Last 2-3 turns (so replies feel connected)
- Ground-truth artifacts (the doc being edited, the current plan)
- Active tool call results still being reasoned over
The trim-aggressively list:
- Old tool results whose info has been used
- Search results the model skimmed but didn't pick
- Intermediate reasoning from earlier phases
- Completed planning steps
A worked example: long research session
After 20 turns, the agent's context is 80K tokens. You can either pay 20× the cost per subsequent turn, or intervene. Here's the intervention:
- Insert a summary turn. Replace turns 1-12 with: "Earlier in this session: user asked about X, I found sources A, B, C, key facts [list], decided to pursue angle Y."
- Keep turns 13-20 in full. These are where the model is currently working.
- Context drops from 80K to ~20K. Cost per turn falls 4×. Quality goes up because "lost in the middle" stops happening.
Do this automatically at a threshold (e.g., whenever context crosses 30K tokens).
Retrieval blurs the line
The advanced move: store everything externally (a vector store, a scratch file), and retrieve only what the current step needs. Short-term memory becomes a working set, not a log. This is how you get agents that run for hours without collapsing.
Pitfalls
- Trimming too aggressively. Dropped the tool result the model actually still needs. Symptom: model asks the same tool question twice.
- Trimming the system prompt. Don't. It's small, and dropping it breaks everything.
- Summarizing too late. Model already lost track by turn 25; summary at turn 30 can't recover.
- Believing "big context window = no management needed." 200K context windows cost 200K tokens of money. And "lost in the middle" happens even at 200K.
What to do with this
- Add a context-size monitor to your agent. When it passes a threshold, insert a summary.
- Read long-term memory for what happens between sessions.
- Read memory system design for how all the memory types compose.