AI Agent Memory Systems: How Long-Term Memory Architectures Keep LLM Agents Coherent
Last Updated: April 18, 2026
Claude’s prompt caching and GPT-5’s native memory features went viral last month—but most developers building agents still don’t understand the internals. When an agent suddenly forgets critical context mid-task or hallucinates past interactions, it’s not the LLM failing: it’s the memory layer collapsing. This post cuts through the hype and shows you exactly how agent memory works, why it matters, and how to architect it for production coherence.
TL;DR
AI agents need memory because LLM context windows are finite. Long-term memory stacks four tiers—working memory (active context), episodic memory (timestamped events), semantic memory (facts), and procedural memory (tools/schemas)—each retrieved differently. MemGPT-style virtual context management pages data in/out; vector stores enable efficient semantic search; consolidation loops compress old events into facts. Production systems combine HNSW indexing, importance scoring, and time-decay to prevent staleness and retrieval collapse. Implementations range from simple RAG (retrieval-augmented generation) to full memory graphs; choosing the right architecture depends on agent complexity, latency budget, and user isolation requirements.
Table of Contents
- Key Concepts Before We Begin
- How AI Agent Memory Works: System Overview
- The Four Memory Tiers: Working, Episodic, Semantic, Procedural
- MemGPT-Style Virtual Context Management
- Vector Memory Retrieval: From Embedding to Answer
- Memory Consolidation and Forgetting
- Benchmarks & Comparison
- Edge Cases & Failure Modes
- Implementation Guide: Building a Memory Layer for Your Agent
- Frequently Asked Questions
- Where Agent Memory Is Heading
- References & Further Reading
- Related Posts
Key Concepts Before We Begin
Before diving into architectures, let’s define the foundational terms. These concepts appear throughout the post and map cleanly to real storage and retrieval patterns you’ll build.
Working Memory (Context Window)
The active, in-memory buffer the LLM reads from on each forward pass. In Claude’s case, this is your token budget (e.g., 200K tokens). Think of it as the agent’s “notepad”—everything relevant to the current decision must fit here. Critically, working memory is finite and expensive, so every design choice downstream exists to keep the most relevant items in this space.
Episodic Memory
Timestamped records of events, conversations, and interactions. “User asked about API rate limits at 2:34 PM on April 15.” Episodic memory is chronological and searchable by time or content, but grows unbounded. Retrieval is typically triggered by temporal queries (“what happened yesterday?”) or semantic similarity (“did we talk about pricing?”).
Semantic Memory
Facts, rules, and general knowledge extracted from episodic events. “API rate limit is 100 requests per minute” is semantic. Unlike episodic memory, semantic facts have no timestamp—they’re treated as persistent truths. Retrieval uses similarity search (vector embeddings) or keyword matching.
Procedural Memory
Tool schemas, function definitions, and action patterns. This is the agent’s “skills”—the set of operations it can perform. Procedural memory rarely changes per session and is retrieved by exact match or schema similarity when deciding what tools to invoke.
Context Window
The token capacity of the underlying LLM. GPT-4’s context is 128K tokens; Claude 3.5 Sonnet offers 200K. This is the hard limit on working memory size. Every token used for memory is a token not available for reasoning or generating the response.
RAG (Retrieval-Augmented Generation)
The broad pattern: encode a query, search a knowledge base, and inject top results into the LLM prompt. RAG solves the knowledge cutoff problem but doesn’t natively handle temporal memory or multi-turn coherence. It’s a building block, not a complete memory solution.
Vector Store
A searchable database of embeddings, indexed for fast approximate nearest-neighbor (ANN) queries. Common implementations include Pinecone, Weaviate, Qdrant, and Milvus. A vector store can hold millions of embeddings; retrieval in milliseconds. The trade-off: embeddings are lossy, so reranking and filtering are essential.
MemGPT
A pioneering 2023 system (Lim et al., arXiv:2310.08560) that introduced virtual context management: the agent treats a limited working memory as a “main memory” and a larger external store as “disk.” Function calls enable paging in/out of context. This unlocked truly long-horizon agent tasks (100K+ token sequences).
Summary Buffer
A memory consolidation technique: periodically summarize old conversation turns to reclaim tokens. Example: “User and agent discussed 10 API endpoints over 50 turns. Key facts: rate limit is 100 req/min, authentication uses JWT, endpoints live at api.example.com.” Summaries compress 50 turns into a few tokens.
How AI Agent Memory Works: System Overview
At its core, agent memory solves a single problem: the LLM has a fixed context window, but agents must operate over unbounded task horizons. Without memory, an agent operating for hours or days would lose track of what it learned, contradicting its own previous answers, and eventually hallucinating past interactions that never occurred.
The solution is a layered retrieval system: when the agent needs context, it queries its memory layers, retrieves the most relevant information, and injects it into the prompt. The system continuously updates memory as new events occur, consolidating old events into facts and pruning low-signal data. This cycle repeats across every agent turn.

The flow works as follows:
- User Input & Current Task arrives. The agent encodes it into its working memory.
- Working Memory Buffer holds the immediate context—the last few turns, the current goal, active tool calls. This is your immediate scope.
- Retrieval Trigger: If the agent senses it needs historical context (“Have we discussed this before?”), it encodes the query and searches its memory layers.
- Vector Store Lookup: The episodic and semantic stores are indexed by embedding. A fast ANN search returns candidate memories.
- Retrieved Context is formatted and injected into the LLM prompt, augmenting the working memory.
- LLM Forward Pass consumes both working and retrieved context to decide on the next action.
- Memory Update: After the action, new events (tool calls, results, user feedback) are stored as episodic entries.
- Consolidation Loop (background): Old episodic entries are scored, summarized into semantic facts, and pruned. This keeps the external store manageable.
This cycle is designed to be transparent to the agent. Crucially, the agent doesn’t explicitly decide what to remember—the retrieval and consolidation layers make those decisions based on access frequency, importance, and age.
The Four Memory Tiers: Working, Episodic, Semantic, Procedural
Not all memory is created equal. Different types of information need different storage, retrieval, and eviction strategies. Production systems use a four-tier model, each optimized for its role.

Working Memory (Tier 1: Hot, In-Process)
Working memory is the LLM’s immediate input. It includes:
– Current user query
– Last 3–5 conversation turns
– Active tool state (e.g., “waiting for database query to complete”)
– Current goal and subgoals
Size: 1K–20K tokens, depending on complexity.
Retrieval: None—it’s always available.
Eviction: Manual (the developer decides what to include in the next prompt).
Failure mode: Overflow causes relevant context to be truncated.
Example: “User asked: ‘Summarize my Q1 spending.’ Agent is querying the expense database. Current step: filtering by category.”
Episodic Memory (Tier 2: Warm, Searchable)
Episodic memory is a log of timestamped events. It includes:
– Conversation turns (user message + agent response)
– Tool calls and results
– User feedback and corrections
– Intermediate reasoning steps
Size: Tens of thousands of events (100K–10M tokens uncompressed).
Retrieval: Semantic search (similarity) or temporal range queries (“all events from the last 7 days”).
Eviction: Time-based (delete events older than N days) or importance-based (keep high-signal events, prune low-signal ones).
Failure mode: Staleness (retrieving outdated information) or explosion (unbounded storage costs).
Example episodic entry:
{
"timestamp": "2026-04-18T14:32:00Z",
"type": "tool_call",
"tool": "query_expense_database",
"input": { "start_date": "2026-01-01", "end_date": "2026-03-31" },
"result": "1247 expenses found; total $14,523.44",
"embedding": [0.12, -0.45, ..., 0.78] # Stored for retrieval
}
Semantic Memory (Tier 3: Warm, Fact Store)
Semantic memory holds extracted facts and general knowledge. It includes:
– Consolidated facts (“User’s annual salary is $120K”)
– Domain rules (“Approval needed for expenses > $5K”)
– User preferences (“User prefers CSV exports”)
– One-shot examples (few-shot learning).
Size: Thousands of facts (10K–100K tokens).
Retrieval: Similarity search (vectors) or keyword/tag matching.
Eviction: Overwrite if new information contradicts (last-write wins), or version tracking for accountability.
Failure mode: Hallucinated facts (the consolidation process introduces errors) or staleness (contradicted by new information but not updated).
Example semantic facts:
Fact ID: user_sal_001
Content: "User's annual salary is $120,000"
Source: Episodic event (user statement on 2026-01-15)
Confidence: High
Last Updated: 2026-01-15
Fact ID: expense_policy_001
Content: "Expenses > $5,000 require manager approval"
Source: Organization policy (not user-provided)
Confidence: High
Last Updated: 2026-04-01
Procedural Memory (Tier 4: Cold, Schema Registry)
Procedural memory defines what the agent can do. It includes:
– Tool definitions (name, parameters, description)
– API schemas and endpoints
– Action patterns and decision trees.
Size: Kilobytes to hundreds of KB (tool definitions are verbose but finite).
Retrieval: Exact match (tool name) or schema similarity (finding analogous tools).
Eviction: Never—these are static or change only with system updates.
Failure mode: Out-of-date schemas (the API changed but the agent’s memory didn’t).
Example procedural entry:
{
"tool_id": "query_expenses",
"name": "query_expense_database",
"description": "Query the user's expense database.",
"input_schema": {
"start_date": { "type": "string", "format": "date" },
"end_date": { "type": "string", "format": "date" },
"category": { "type": "string", "enum": ["food", "travel", "software", ...] }
},
"output_schema": {
"expenses": [{ "date": "string", "amount": "number", "category": "string" }]
}
}
Cross-Tier Interactions
These tiers interact tightly. When the agent decides to make a tool call, it retrieves the tool definition from procedural memory (Tier 4), uses semantically similar facts from semantic memory (Tier 3) to fill in parameters, and may reference recent episodic events (Tier 2) for context. The resulting action is added as a new episodic entry. If the action produces a significant fact (e.g., “user’s manager approved this expense”), a consolidation loop may extract it into semantic memory.
MemGPT-Style Virtual Context Management
The seminal breakthrough in agent long-term memory came from Lim et al.’s MemGPT paper (2023). The core insight: treat the LLM’s context window as main memory and an external store as virtual memory. When context fills up, the agent pages data out; when it needs data back, it pages in. This is borrowed from operating systems, but applied to prompts.

The Paging Model
MemGPT divides an agent’s memory into:
- Core Context: The portion the agent actively uses. Size: 2K–8K tokens. Includes current goal, recent turns, active tool state.
- Main Memory: An extended in-context buffer, refreshed each turn. Size: 4K–32K tokens. Holds summaries, facts, recent episodic events.
- External Memory: Vector-indexed store on disk/database. Size: Unbounded. Holds raw episodic events, archives.
On each turn:
- Agent observes its goal and main memory.
- If main memory fills beyond a threshold (e.g., 80%), it invokes a
save_memory()function call that passes out-of-date entries to the external store. - Consolidation logic summarizes the outgoing entry (e.g., compress 10 conversation turns into a 50-token summary).
- The summary is stored in external memory with an embedding.
- Core context is refreshed with the next most relevant entries from main memory.
Eviction Policies
Which entries get paged out? MemGPT uses several strategies:
- LRU (Least Recently Used): Page out entries not accessed for the longest.
- Importance Scoring: Assign each episodic entry a score based on information density, user-relevance, and frequency of mention. Page out low-scoring entries.
- Temporal: Page out entries older than a threshold (e.g., “events older than 1 day”).
- Hybrid: Combine signals (e.g., “if entry is old AND low importance, page it out; if recent and high importance, keep it”).
Retrieval on Demand
When the agent needs information not in main memory, it issues a search_memory() function call with a query. The system:
- Embeds the query.
- Searches external memory with ANN (approximate nearest neighbor).
- Ranks results by relevance (cross-encoder reranking).
- Pages the top-K results back into main memory, possibly paging out low-priority existing entries.
Why It Works
MemGPT-style paging unlocked truly long-horizon agent tasks. In the original paper, agents using MemGPT completed tasks involving 100K+ tokens of context—far beyond the base model’s capacity. The trick: the agent can reason about its own memory. It can call save_memory() proactively when it knows it won’t need data soon, and call search_memory() when it realizes it forgot something.
Vector Memory Retrieval: From Embedding to Answer
Vector retrieval is the engine powering episodic and semantic memory queries. When an agent asks “Did we discuss API pricing?”, a vector retrieval system finds similar past interactions in milliseconds.

The Steps
-
Query Embedding: The agent’s question is encoded into a dense vector. Modern embedders (OpenAI’s ada-002, Anthropic’s embedding models) map text to 1536-dimensional space. Texts with similar meaning land near each other geometrically.
-
Vector Store Lookup: The query vector is compared against a pre-indexed collection of episodic/semantic memory vectors. This uses an approximate nearest-neighbor (ANN) algorithm, not brute-force distance comparison.
-
ANN Search Algorithms:
– HNSW (Hierarchical Navigable Small World): Default for many systems. Builds a multi-layer navigable graph. Query complexity: O(log(n)). Memory overhead: ~8x the data size. Excellent for dense, high-recall scenarios.
– IVF (Inverted File Index): Cluster-based approach. Divides vectors into clusters, searches only relevant clusters. Memory: ~1x. Trade-off: lower recall if clusters are misaligned. -
Candidate Retrieval: Top-K candidates (e.g., top 10 episodic events) are returned. These are approximate nearest neighbors, not exact.
-
Reranking: A cross-encoder model re-scores the top-K candidates using the full query and each candidate’s full text. This is more expensive but significantly improves precision. A cross-encoder (e.g., BAAI’s bge-reranker-large) reads query + candidate together and outputs a relevance score, often recovering 5–15% precision over the embedding-only ranking.
-
Filtering & Deduplication: Remove near-duplicates (cosine similarity > 0.95) and apply metadata filters (e.g., “only events from the last 30 days”).
-
Context Formatting: Results are formatted into a prompt-friendly structure:
“`
## Memory: Related Past Interactions
[2026-04-15 14:30] User: “What’s your API rate limit?”
Agent: “The rate limit is 100 requests per minute.”
[2026-04-12 09:15] User: “Can I increase my rate limit?”
Agent: “Rate limit increases require contacting support.”
“`
- Injection into Prompt: The formatted context is prepended or appended to the working memory, and the agent’s LLM forward pass consumes both.
Embedding Quality & Relevance
The quality of vector retrieval hinges on embeddings. Poor embeddings lead to retrieval collapse: relevant memories aren’t found because the query vector lands far from their vectors, despite semantic similarity. To mitigate:
- Use domain-specific or fine-tuned embedders (e.g., Anthropic’s embedding model is trained on longer contexts than ada-002).
- Chunk episodic events carefully. One event per vector (vs. concatenating multiple events) improves recall.
- Hybrid search: combine vector similarity with keyword/BM25 ranking. If both signals agree, confidence rises.
Latency & Throughput
Vector retrieval latency is critical for interactive agents. Breakdown:
– Embedding query: 10–50 ms
– ANN search: 5–20 ms
– Reranking (top 10): 50–200 ms
– Total: 65–270 ms
For agents making decisions every few seconds, 200 ms is acceptable. But for real-time / sub-100ms agents, reranking must be skipped or done asynchronously.
Memory Consolidation and Forgetting
An unbounded episodic store eventually becomes unusable: retrieval becomes slower (larger indices), storage costs explode, and stale information pollutes results. The solution is consolidation: periodically compress old episodic events into semantic facts, then delete the raw events.

The Consolidation Process
-
Trigger: Consolidation runs on a schedule (e.g., every 100 new episodic events) or when storage hits a threshold.
-
Candidate Selection: Identify episodic events to consolidate. Candidates are typically:
– Older than N days (e.g., 30 days)
– Low recency-weighted score (rarely accessed recently)
– High aggregate importance (collectively cover high-signal information). -
Summarization: A summarization function (could be an LLM or a deterministic rule) compresses multiple episodic events into fewer semantic facts. Example:
Raw Episodic Events:
“`
[Turn 1] User: “What’s our cloud storage cost?”
Agent: “I’ll query the billing database.”
[Turn 2] User: “Any discounts for long-term contracts?”
Agent: “Yes, 20% discount for 3-year commitments.”
[Turn 3] User: “What’s the contract term?”
Agent: “3 years. Cost: $12K/year with the discount.”
“`
Consolidated Semantic Fact:
Fact: "User inquired about cloud storage pricing.
Key info: $12K/year with 20% discount for 3-year contract.
Consolidated from turns 1-3 on 2026-04-15."
-
Embedding & Storage: The new semantic fact is embedded and stored in the semantic memory layer with metadata (source date range, confidence score).
-
Deletion: The original episodic events are marked for deletion (or archived to cold storage).
Importance Scoring
Which events are “important” enough to preserve? Scoring heuristics include:
- Information Density: How much novel information is in the event? A turn where the user reveals a critical preference is high-scoring; a turn where they repeat themselves is low-scoring.
- Recency Weighting: Recent events score higher (exponential or polynomial decay over time).
- Mention Frequency: If the agent references an event in many subsequent turns, it’s important.
- User Feedback: Explicit user corrections or emphasis (“Remember this!”) boost scores.
Example scoring formula:
importance_score =
0.3 * information_density +
0.3 * (1 - time_decay(age)) +
0.2 * mention_frequency +
0.2 * user_emphasis_signal
# Prune if score < threshold (e.g., 0.2)
Time-Decay & Staleness
Memories fade. A fact that was true 6 months ago might no longer apply. Consolidation handles this via:
- Explicit Versioning: Store facts with valid_from / valid_to dates. “Pricing was $10K/year as of 2025-01-01; updated to $12K/year on 2026-01-15.”
- Confidence Decay: Assign a confidence score that decreases with age. At retrieval time, filter out facts below a confidence threshold.
- Active Contradiction: If new information contradicts a stored fact, mark the old fact as superseded.
Benchmarks & Comparison
To choose a memory architecture, you need to compare real systems. Here’s a high-level matrix of popular approaches:
| System | Storage Model | Retrieval Mechanism | Consolidation | Best For |
|---|---|---|---|---|
| MemGPT (standalone) | Episodic (timestamped) + Semantic (summaries) | Vector ANN + manual search_memory() calls | LLM-based summarization | Long-horizon multi-turn agents |
| LangGraph Memory (LangChain) | Conversation buffer (rolling window) | Implicit (all recent turns) | Summary buffer (truncate old turns) | Lightweight chat agents |
| Letta | Episodic + Semantic + Procedural | Vector retrieval + importance scoring | Automated consolidation loop | Production-grade agent frameworks |
| Zep | Episodic (message log) + Semantic (facts) | Hybrid (BM25 + vector) | Automatic time-decay + summarization | Multi-agent systems with shared context |
| OpenAI Assistants Memory | Proprietary (managed service) | Implicit retrieval (not exposed) | Managed by OpenAI | Simple chatbots, low control |
| Anthropic Claude Memory (2026) | Hierarchical (native + external) | Prompt caching + vector retrieval | Native consolidation | Native Anthropic integrations |
Key Trade-Offs:
- Abstraction vs. Control: MemGPT and standalone Letta expose memory internals; OpenAI hides them. More control means more work but greater flexibility.
- Consolidation: LangGraph relies on truncation (lose information); MemGPT/Letta/Zep use summarization (compress information).
- Vector DB Dependency: All modern systems except LangGraph’s basic buffer depend on a vector store. This adds infrastructure but improves recall.
- Cost: Managed services (OpenAI, Anthropic) bundle memory into API cost; self-hosted systems (Letta, Zep) require infrastructure but offer lower per-token costs at scale.
Edge Cases & Failure Modes
Real-world agent memory systems fail in predictable ways. Understanding these modes helps you design resilient systems.
Memory Poisoning
An adversary injects false information into episodic memory, which gets consolidated into semantic facts. The agent then bases decisions on lies.
Mitigation:
– Confidence scoring: Only consolidate facts with high-confidence source events.
– Source tracking: Store the original episodic event ID with each semantic fact. If the episodic event is later flagged as false, downweight the fact.
– User verification: For critical decisions, require the agent to confirm facts with the user before relying on them.
Retrieval Collapse
The vector embedding space degrades over time. Similar queries now produce dissimilar vector distances, so retrieval fails. This happens when:
– Training data distribution shifts (new query types unseen by the embedder).
– Insufficient reranking (low embedding precision propagates to results).
– Eviction policy is too aggressive (the information needed is gone).
Mitigation:
– Periodic reindexing: Refresh embedding model and re-embed all data.
– Ensemble retrieval: Combine multiple ANN indices or use hybrid search (vector + keyword).
– Conservative retention: Keep recent data longer; raise eviction thresholds.
Staleness & Contradiction
Old semantic facts become outdated. An agent retrieves “User prefers email over Slack” from 2025, but the user switched to Slack-only in 2026.
Mitigation:
– Versioning: Store facts with valid date ranges.
– Explicit updates: When the agent detects contradictory information, flag old facts as superseded.
– User corrections: Provide a mechanism for users to correct stored facts (“I changed my preference”).
Hallucinated Recall
The agent reports a past interaction that never occurred. This happens when the LLM, given retrieved context, confabulates additional details that weren’t in the original episodic entry.
Mitigation:
– Exact quotes: Store and retrieve verbatim text, not summaries or paraphrases.
– Grounding checks: Ask the agent to cite the exact episodic event ID for factual claims.
– User verification: Show the user the retrieved context and ask “Is this what you meant?”
Context Overflow
After consolidation and retrieval, the injected memory + working memory exceeds the LLM’s context window, causing truncation.
Mitigation:
– Adaptive retrieval: Retrieve fewer results if context is already full.
– Importance-weighted truncation: If truncation is necessary, cut low-importance items first.
– Hierarchical summarization: Instead of injecting full episodic events, inject a summary, then allow drilling-down on demand.
Implementation Guide: Building a Memory Layer for Your Agent
Here’s a step-by-step guide to implementing a basic agent memory system. We’ll build episodic + semantic memory with vector retrieval.
Prerequisites:
– Python 3.9+
– A vector store (Pinecone, Weaviate, Qdrant, or local HNSW)
– An LLM (OpenAI, Anthropic, or local)
– An embedding model (OpenAI ada-002, Anthropic, or Sentence Transformers)
Step 1: Define Your Episodic Event Schema
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Dict, List
import uuid
@dataclass
class EpisodicEvent:
event_id: str
timestamp: datetime
event_type: str # "user_message", "tool_call", "tool_result", "feedback"
content: str
metadata: Dict[str, Any]
embedding: List[float] = None
def to_dict(self):
return {
"event_id": self.event_id,
"timestamp": self.timestamp.isoformat(),
"event_type": self.event_type,
"content": self.content,
"metadata": self.metadata,
"embedding": self.embedding,
}
@dataclass
class SemanticFact:
fact_id: str
content: str
source_event_ids: List[str]
created_at: datetime
confidence: float # 0.0 to 1.0
embedding: List[float] = None
valid_from: datetime = None
valid_to: datetime = None
def to_dict(self):
return {
"fact_id": self.fact_id,
"content": self.content,
"source_event_ids": self.source_event_ids,
"created_at": self.created_at.isoformat(),
"confidence": self.confidence,
"embedding": self.embedding,
"valid_from": self.valid_from.isoformat() if self.valid_from else None,
"valid_to": self.valid_to.isoformat() if self.valid_to else None,
}
Step 2: Implement the Vector Store Wrapper
import pinecone # Example: using Pinecone
class MemoryVectorStore:
def __init__(self, index_name: str, dimension: int = 1536):
self.index = pinecone.Index(index_name)
self.dimension = dimension
def upsert_episodic(self, event: EpisodicEvent):
"""Store or update an episodic event."""
if event.embedding is None:
raise ValueError("Event must have embedding")
metadata = {
"event_id": event.event_id,
"event_type": event.event_type,
"timestamp": event.timestamp.isoformat(),
"content": event.content,
**event.metadata,
}
self.index.upsert([(
f"episodic_{event.event_id}",
event.embedding,
metadata,
)])
def upsert_semantic(self, fact: SemanticFact):
"""Store or update a semantic fact."""
if fact.embedding is None:
raise ValueError("Fact must have embedding")
metadata = {
"fact_id": fact.fact_id,
"content": fact.content,
"confidence": fact.confidence,
"created_at": fact.created_at.isoformat(),
}
self.index.upsert([(
f"semantic_{fact.fact_id}",
fact.embedding,
metadata,
)])
def retrieve(self, query_embedding: List[float], top_k: int = 5,
event_type_filter: str = None) -> List[Dict]:
"""Retrieve top-K similar episodic or semantic memories."""
filter_dict = None
if event_type_filter:
filter_dict = {"event_type": {"$eq": event_type_filter}}
results = self.index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True,
filter=filter_dict,
)
return results["matches"]
Step 3: Embedding Pipeline
import openai
class EmbeddingEngine:
def __init__(self, model: str = "text-embedding-3-small"):
self.model = model
def embed_text(self, text: str) -> List[float]:
"""Embed a single text."""
response = openai.Embedding.create(
input=text,
model=self.model,
)
return response["data"][0]["embedding"]
def embed_batch(self, texts: List[str]) -> List[List[float]]:
"""Embed multiple texts."""
response = openai.Embedding.create(
input=texts,
model=self.model,
)
# Sort by index to maintain order
sorted_data = sorted(response["data"], key=lambda x: x["index"])
return [item["embedding"] for item in sorted_data]
Step 4: Consolidation Loop
from anthropic import Anthropic
class ConsolidationEngine:
def __init__(self, llm_client, embedding_engine: EmbeddingEngine,
vector_store: MemoryVectorStore):
self.llm = llm_client
self.embedder = embedding_engine
self.store = vector_store
def consolidate_events(self, events: List[EpisodicEvent],
agent_id: str) -> SemanticFact:
"""Summarize episodic events into a semantic fact."""
# Format events for LLM
event_text = "\n".join([
f"[{e.timestamp.isoformat()}] ({e.event_type}) {e.content}"
for e in events
])
# Prompt for consolidation
prompt = f"""Consolidate these conversation events into 1–2 key facts:
{event_text}
Output a brief semantic fact (one sentence) that captures the essential information."""
response = self.llm.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=256,
messages=[{"role": "user", "content": prompt}],
)
fact_content = response.content[0].text
# Embed the fact
embedding = self.embedder.embed_text(fact_content)
# Create semantic fact
fact = SemanticFact(
fact_id=str(uuid.uuid4()),
content=fact_content,
source_event_ids=[e.event_id for e in events],
created_at=datetime.now(),
confidence=0.85, # Heuristic
embedding=embedding,
)
# Store
self.store.upsert_semantic(fact)
return fact
def run_consolidation_loop(self, agent_id: str, events: List[EpisodicEvent],
batch_size: int = 5):
"""Consolidate old episodic events in batches."""
for i in range(0, len(events), batch_size):
batch = events[i:i+batch_size]
self.consolidate_events(batch, agent_id)
print(f"Consolidated batch {i//batch_size + 1}")
Step 5: Agent Memory Manager
from datetime import timedelta
class AgentMemoryManager:
def __init__(self, agent_id: str, vector_store: MemoryVectorStore,
embedding_engine: EmbeddingEngine,
consolidation_engine: ConsolidationEngine):
self.agent_id = agent_id
self.store = vector_store
self.embedder = embedding_engine
self.consolidator = consolidation_engine
self.episodic_buffer = [] # Local in-memory buffer
def add_event(self, event: EpisodicEvent):
"""Add a new event to episodic memory."""
# Embed
event.embedding = self.embedder.embed_text(event.content)
# Store
self.store.upsert_episodic(event)
self.episodic_buffer.append(event)
def retrieve_context(self, query: str, top_k: int = 5) -> str:
"""Retrieve memory context for a given query."""
# Embed query
query_embedding = self.embedder.embed_text(query)
# Retrieve from vector store
results = self.store.retrieve(query_embedding, top_k=top_k)
# Format for prompt injection
context_lines = []
for result in results:
metadata = result["metadata"]
content = metadata.get("content", "")
timestamp = metadata.get("timestamp", "unknown")
context_lines.append(f"[{timestamp}] {content}")
if context_lines:
return "## Memory Context\n\n" + "\n".join(context_lines)
return ""
def maintenance(self, days_old: int = 30):
"""Run periodically: consolidate old events, prune low-signal data."""
cutoff = datetime.now() - timedelta(days=days_old)
old_events = [e for e in self.episodic_buffer if e.timestamp < cutoff]
if old_events:
print(f"Consolidating {len(old_events)} events...")
self.consolidator.run_consolidation_loop(
self.agent_id,
old_events,
batch_size=5
)
print("Consolidation complete.")
Step 6: Integration with Agent Loop
def agent_loop(memory_manager: AgentMemoryManager, llm_client, user_message: str):
"""Main agent loop with memory."""
# 1. Record user message as episodic event
user_event = EpisodicEvent(
event_id=str(uuid.uuid4()),
timestamp=datetime.now(),
event_type="user_message",
content=user_message,
metadata={"user_id": "user_123"},
)
memory_manager.add_event(user_event)
# 2. Retrieve relevant memory context
memory_context = memory_manager.retrieve_context(user_message, top_k=3)
# 3. Build prompt
system_prompt = """You are a helpful agent with long-term memory.
Use retrieved memory context when relevant. Always cite sources."""
prompt = f"""{memory_context}
User: {user_message}"""
# 4. LLM call
response = llm_client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
system=system_prompt,
messages=[{"role": "user", "content": prompt}],
)
agent_response = response.content[0].text
# 5. Record agent response
agent_event = EpisodicEvent(
event_id=str(uuid.uuid4()),
timestamp=datetime.now(),
event_type="agent_response",
content=agent_response,
metadata={"turn": 1},
)
memory_manager.add_event(agent_event)
return agent_response
# Usage
from anthropic import Anthropic
llm = Anthropic()
memory_mgr = AgentMemoryManager(
agent_id="agent_001",
vector_store=MemoryVectorStore(index_name="agent_memory"),
embedding_engine=EmbeddingEngine(),
consolidation_engine=ConsolidationEngine(llm, EmbeddingEngine(),
MemoryVectorStore(index_name="agent_memory")),
)
response = agent_loop(memory_mgr, llm, "What did we discuss about pricing earlier?")
print(response)
This code demonstrates the core flow: events → embeddings → vector retrieval → prompt injection → agent reasoning → event recording → consolidation. Real production systems add error handling, batching, async consolidation, and multi-agent coordination.
Frequently Asked Questions
Q1. How is agent memory different from fine-tuning?
Fine-tuning bakes knowledge into the model weights. Agent memory retrieves knowledge at runtime. Trade-offs:
– Fine-tuning: Permanent, efficient (no retrieval cost), but slow to update (requires retraining).
– Runtime memory: Dynamic (updated continuously), flexible (swap stores at runtime), but higher latency and cost per query.
For agents learning from user interactions, runtime memory wins because the agent must adapt in real-time.
Q2. What’s the best vector database?
No single answer; it depends on scale, latency, and infrastructure:
– Pinecone: Managed, simple API, fast to prototype. Cost scales with storage.
– Qdrant: Self-hosted, open-source, excellent recall. Operational overhead.
– Weaviate: GraphQL API, hybrid search (vector + keyword). Moderate learning curve.
– HNSW (local): In-process, zero infrastructure. Limited to single-machine scale.
– Milvus: Cloud-native, Kubernetes-friendly. Complex deployment.
For startups, Pinecone or Qdrant. For scale, Milvus or Weaviate.
Q3. Does memory leak across users?
Only if your isolation is poor. Best practices:
– Shard episodic/semantic stores by user or tenant.
– Add access control: agent can only retrieve memories tagged with its user_id.
– Encrypt stored embeddings (if privacy is critical).
– Audit memory access (log all retrievals).
Multi-tenant systems must design isolation in explicitly; it’s not automatic.
Q4. How do you evaluate memory quality?
Measure three things:
1. Recall: Of the relevant past interactions, how many did retrieval find? (Use human annotations.)
2. Precision: Of the retrieved results, how many were actually relevant? (Again, human labels.)
3. Staleness: What fraction of retrieved facts were recently contradicted? (Compare with ground truth updates.)
Example evaluation:
# Manual labeling
test_queries = [
("What's the user's email?", ["event_123", "event_456"]), # Relevant events
("Did we discuss budget?", ["event_789"]),
...
]
for query, true_events in test_queries:
retrieved = memory_mgr.retrieve_context(query, top_k=5)
predicted_event_ids = [r["metadata"]["event_id"] for r in retrieved]
recall = len(set(true_events) & set(predicted_event_ids)) / len(true_events)
precision = len(set(true_events) & set(predicted_event_ids)) / len(predicted_event_ids)
print(f"Query: {query}, Recall: {recall:.2f}, Precision: {precision:.2f}")
Q5. What about privacy? Can the agent memorize PII?
By design, yes—which is dangerous. Mitigations:
– PII Detection: Run a classifier on episodic events before storing. Flag or redact credit card numbers, SSNs, etc.
– Redaction: Replace “SSN: 123-45-6789” with “[REDACTED_SSN]” before embedding and storing.
– Encrypted Storage: Encrypt episodic/semantic stores at rest. Only decrypt during retrieval.
– Data Retention Policies: Auto-delete sensitive events after N days.
Example redaction:
import re
def redact_pii(text: str) -> str:
"""Redact PII from text."""
text = re.sub(r"\b\d{3}-\d{2}-\d{4}\b", "[REDACTED_SSN]", text)
text = re.sub(r"\b\d{16}\b", "[REDACTED_CARD]", text)
text = re.sub(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[REDACTED_EMAIL]", text)
return text
Where Agent Memory Is Heading
Agent memory systems are evolving rapidly. Here’s what’s emerging in 2026–2027:
1. Multimodal Memory
Agents are integrating video, images, and audio. Consolidation will need to summarize visual scenes (“The user showed me a screenshot of the error; it was a 404 in the login form”). Vector embeddings will be multimodal (CLIP-style), bridging text and vision.
2. Graph-Based Memory
Instead of flat episodic/semantic stores, memory will be structured as knowledge graphs: nodes (entities, facts, events) and edges (relationships). Retrieval becomes graph traversal. Benefits: richer reasoning, fewer hallucinations, better composability.
3. Collaborative Filtering for Memory
Multiple agents will share a pool of semantic facts. If Agent A learns “User prefers Slack over email,” Agent B can benefit. Techniques: federated learning, privacy-preserving aggregation, fact voting (high-confidence facts are shared; low-confidence ones remain private).
4. Adaptive Consolidation
Instead of fixed schedules, consolidation will adapt to agent behavior. High-activity agents consolidate frequently; dormant agents never consolidate. Consolidation will prioritize facts the agent actually uses, not facts that are objectively important.
5. Neuro-Symbolic Integration
Combining neural (vector-based) memory with symbolic (knowledge graph, logic-based) systems. An agent might use neural retrieval to find candidate memories, then symbolic reasoning to verify them. Example: “Is this fact consistent with the knowledge graph?”
6. Native LLM Integration
OpenAI, Anthropic, and others are shipping memory as first-class features. Rather than external systems, memory management will be internal: prompt caching (reducing redundant processing), native consolidation (LLM-controlled), and managed storage. This reduces developer burden but sacrifices flexibility.
References & Further Reading
Primary Research Papers
- Lim, C., Lau, J. T., Wen, A., & Schuetz, S. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://arxiv.org/abs/2310.08560
-
Seminal work introducing virtual context management for agents.
-
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search with Hierarchical Navigable Small World Graphs. IEEE Transactions on PAMI. https://arxiv.org/abs/1603.09320
-
The HNSW algorithm powering many vector stores.
-
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS. https://arxiv.org/abs/2005.11401
- Foundation for RAG-based memory retrieval.
Frameworks & Tools
- Letta (formerly MemGPT): https://www.letta.com — Production-grade agent memory framework.
- LangGraph: https://docs.langchain.com/langgraph — Memory primitives in LangChain.
- Zep: https://docs.getzep.com — Multi-agent memory management.
- Pinecone: https://www.pinecone.io — Managed vector database.
- Qdrant: https://qdrant.tech — Open-source vector store.
Anthropic Resources
- Anthropic Prompt Caching: https://docs.anthropic.com/en/docs/build-a-bot/memory — Native memory features in Claude.
- Claude API Docs: https://docs.anthropic.com — Latest context window and memory announcements.
Further Reading
- “The Illustrated Transformer” (Jay Alammar): Essential for understanding embedding spaces.
- “Vector Search in AI” (Pinecone Blog): Practical guide to vector indexing.
- “Building AI Agents” (OpenAI Cookbook): Patterns for agent design.
Related Posts
Deepen your understanding with these related posts:
- Agentic RAG Architecture Patterns — How RAG enables agents to ground reasoning in external knowledge.
- GraphRAG: Knowledge Graph Retrieval-Augmented Generation Architecture — Structured retrieval using semantic graphs.
- AI Agents in the Trough of Disillusionment: Enterprise Deployment Lessons — Production challenges and failure modes.
- Multimodal AI Architecture: Vision, Language, and Audio Fusion — Extending memory beyond text.
Last Updated: April 18, 2026
