Home›Expertise›AI Agents›Short-term memory

Short-term memory

📖 3 min readUpdated 2026-04-19

Short-term memory is everything that's in the model's context window right now. Not some exotic abstraction. Just: the system prompt, the user's question, the conversation so far, the tool calls made, the results that came back. That's what the model reads every time it takes a step. If you don't manage it, long sessions get slow, expensive, and, worst, wrong. The model starts forgetting what it was doing.

What's actually in short-term

Every LLM call includes all of the above. After 15 turns of tool use, that's 30+ blocks of content piling up. Which is where the problems start.

The long-session collapse

Left unmanaged, short-term memory explodes:

Cost grows linearly. Every call pays for the full context. A 50K-token context at 15 turns costs 15× what a 3K one does.
Latency grows. Long contexts take longer to process.
"Lost in the middle." Models attend less to content in the middle of long contexts. Important info from turn 3 gets ignored by turn 15.
Reasoning decays. At very long contexts, models get confused about what step they're on and what they're trying to do.

The cure is aggressive management: trim, summarize, or retrieve. Not "just use a longer context window."

Four strategies for keeping it lean

Keep in full, trim the rest

The always-keep list, regardless of session length:

System prompt (tells the model who it is)
Original user request (the goal)
Last 2-3 turns (so replies feel connected)
Ground-truth artifacts (the doc being edited, the current plan)
Active tool call results still being reasoned over

The trim-aggressively list:

Old tool results whose info has been used
Search results the model skimmed but didn't pick
Intermediate reasoning from earlier phases
Completed planning steps

A worked example: long research session

After 20 turns, the agent's context is 80K tokens. You can either pay 20× the cost per subsequent turn, or intervene. Here's the intervention:

Insert a summary turn. Replace turns 1-12 with: "Earlier in this session: user asked about X, I found sources A, B, C, key facts [list], decided to pursue angle Y."
Keep turns 13-20 in full. These are where the model is currently working.
Context drops from 80K to ~20K. Cost per turn falls 4×. Quality goes up because "lost in the middle" stops happening.

Do this automatically at a threshold (e.g., whenever context crosses 30K tokens).

Retrieval blurs the line

The advanced move: store everything externally (a vector store, a scratch file), and retrieve only what the current step needs. Short-term memory becomes a working set, not a log. This is how you get agents that run for hours without collapsing.

Pitfalls

Trimming too aggressively. Dropped the tool result the model actually still needs. Symptom: model asks the same tool question twice.
Trimming the system prompt. Don't. It's small, and dropping it breaks everything.
Summarizing too late. Model already lost track by turn 25; summary at turn 30 can't recover.
Believing "big context window = no management needed." 200K context windows cost 200K tokens of money. And "lost in the middle" happens even at 200K.

What to do with this

Add a context-size monitor to your agent. When it passes a threshold, insert a summary.
Read long-term memory for what happens between sessions.
Read memory system design for how all the memory types compose.