AI Agent Memory: Short-Term, Long-Term and Episodic Patterns

Memory is what separates a stateless chatbot from a truly intelligent agent. Without memory, every conversation starts from zero — the agent cannot recall previous interactions, user preferences, or lessons learned from past tasks. In 2026, memory architecture is one of the most actively evolving areas of AI engineering, with patterns emerging around four distinct memory types: short-term (context window), long-term (persistent vector store), episodic (structured past experience), and semantic (factual knowledge base).

This guide covers practical implementations of each memory type, including context window management, conversation summarization, vector-based memory retrieval, episodic memory stores, and production patterns for agents that need to remember across sessions.

Table of Contents

Four Types of Agent Memory

Drawing from cognitive science, AI agent memory systems are organized into four types, each serving a different purpose. Understanding which type to use for each use case prevents both under-building (agents that forget everything) and over-engineering (storing everything when only recent context matters).

Short-term memory is the active context window — the messages currently in the conversation. It has a hard limit (128K–200K tokens for modern models) and is lost when the session ends. This is where the agent's current reasoning lives.

Long-term memory persists across sessions, stored in a database and retrieved by semantic similarity when relevant. Users' preferences, past decisions, and learned facts live here.

Episodic memory stores structured records of past tasks and their outcomes — "what I did, what happened, what I learned." This enables agents to improve over time and avoid repeating mistakes.

Semantic memory is a curated knowledge base of facts the agent should always have access to — company policies, product specs, domain knowledge. This is typically implemented as a RAG knowledge base over static or slowly-changing documents.

Short-Term Memory: Context Window Management

The context window is the agent's working memory. For multi-turn conversations, you must track the full message history and decide what to include in each API call. As conversations grow, you hit the token limit — the naive solution is to truncate old messages, but this loses critical context. Smarter approaches include sliding window, selective retention, and summarization.

from openai import OpenAI
from dataclasses import dataclass, field
import tiktoken

client = OpenAI()

def count_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    """Count tokens for a list of messages."""
    enc = tiktoken.encoding_for_model(model)
    tokens = 0
    for msg in messages:
        tokens += 4  # overhead per message
        tokens += len(enc.encode(msg.get("content", "") or ""))
    return tokens + 2  # priming tokens

class SlidingWindowMemory:
    """Keep most recent messages within a token budget."""

    def __init__(self, max_tokens: int = 6000, model: str = "gpt-4o"):
        self.max_tokens = max_tokens
        self.model = model
        self.messages: list[dict] = []
        self.system_prompt: str = ""

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        """Drop oldest non-system messages to stay within token budget."""
        system_msgs = [{"role": "system", "content": self.system_prompt}] if self.system_prompt else []
        while self.messages and count_tokens(system_msgs + self.messages) > self.max_tokens:
            self.messages.pop(0)

    def get_context(self) -> list[dict]:
        result = []
        if self.system_prompt:
            result.append({"role": "system", "content": self.system_prompt})
        result.extend(self.messages)
        return result

    def chat(self, user_message: str) -> str:
        self.add("user", user_message)
        response = client.chat.completions.create(
            model=self.model,
            messages=self.get_context(),
            max_tokens=512,
        )
        reply = response.choices[0].message.content
        self.add("assistant", reply)
        return reply

# Usage
memory = SlidingWindowMemory(max_tokens=4000)
memory.system_prompt = "You are a helpful Python tutor."
print(memory.chat("What is a decorator?"))
print(memory.chat("Show me a memoization example."))

Memory Summarization

Instead of dropping old messages when the context fills, summarization compresses them into a compact summary that preserves key facts. When the conversation exceeds a threshold, summarize the oldest N messages into a "memory summary" and replace them. This retains the gist of past interactions at a fraction of the token cost.

from openai import OpenAI

client = OpenAI()

class SummarizingMemory:
    """Compress old messages into rolling summaries."""

    def __init__(self, max_messages: int = 10, summary_model: str = "gpt-4o-mini"):
        self.max_messages = max_messages
        self.summary_model = summary_model
        self.system_prompt: str = ""
        self.summary: str = ""       # Rolling summary of past messages
        self.recent: list[dict] = [] # Recent messages kept verbatim

    def _summarize_oldest(self, n: int = 4):
        """Compress the oldest n messages into the rolling summary."""
        to_summarize = self.recent[:n]
        self.recent = self.recent[n:]

        conversation_text = "\n".join(
            f"{m['role'].upper()}: {m['content']}" for m in to_summarize
        )
        prompt = f"""Previous summary: {self.summary or 'None'}

New conversation to add:
{conversation_text}

Create an updated, concise summary covering all key points, decisions, and user preferences.
Be specific about names, numbers, and commitments. Maximum 200 words."""

        response = client.chat.completions.create(
            model=self.summary_model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
            temperature=0,
        )
        self.summary = response.choices[0].message.content

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        if len(self.recent) > self.max_messages:
            self._summarize_oldest(n=4)

    def get_context(self) -> list[dict]:
        messages = []
        if self.system_prompt:
            messages.append({"role": "system", "content": self.system_prompt})
        if self.summary:
            messages.append({"role": "system", "content": f"[Memory summary]: {self.summary}"})
        messages.extend(self.recent)
        return messages

    def chat(self, user_message: str) -> str:
        self.add("user", user_message)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=self.get_context(),
            max_tokens=512,
        )
        reply = response.choices[0].message.content
        self.add("assistant", reply)
        return reply

Long-Term Vector Memory

Long-term memory persists across sessions using a vector database. When the agent learns something important — a user preference, a key fact, a decision — it stores it as an embedding. On each new turn, it retrieves the most relevant memories and includes them in the system prompt. This gives agents the ability to "remember" users across weeks or months of interactions.

import json
from datetime import datetime
from openai import OpenAI
import numpy as np

client = OpenAI()

class VectorMemoryStore:
    """Long-term memory backed by embeddings. In production, use ChromaDB or Pinecone."""

    def __init__(self):
        self.memories: list[dict] = []
        self.embeddings: list[np.ndarray] = []

    def _embed(self, text: str) -> np.ndarray:
        response = client.embeddings.create(model="text-embedding-3-small", input=[text])
        return np.array(response.data[0].embedding)

    def store(self, content: str, metadata: dict = None):
        """Store a memory with its embedding."""
        embedding = self._embed(content)
        self.memories.append({
            "content": content,
            "timestamp": datetime.utcnow().isoformat(),
            "metadata": metadata or {},
        })
        self.embeddings.append(embedding)
        print(f"Stored memory: {content[:60]}...")

    def retrieve(self, query: str, top_k: int = 3, threshold: float = 0.7) -> list[dict]:
        """Retrieve most relevant memories for a query."""
        if not self.memories:
            return []
        q_vec = self._embed(query)
        emb_matrix = np.array(self.embeddings)
        scores = emb_matrix @ q_vec / (
            np.linalg.norm(emb_matrix, axis=1) * np.linalg.norm(q_vec)
        )
        ranked = sorted(zip(scores, self.memories), reverse=True)
        return [mem for score, mem in ranked[:top_k] if score >= threshold]

    def format_for_prompt(self, memories: list[dict]) -> str:
        if not memories:
            return ""
        lines = ["[Relevant memories from past interactions:]"]
        for m in memories:
            ts = m["timestamp"][:10]
            lines.append(f"- ({ts}) {m['content']}")
        return "\n".join(lines)

# Usage: agent that remembers user preferences
store = VectorMemoryStore()

# After a conversation, extract and store memories
store.store("User prefers Python code examples over Java", {"user_id": "u123", "type": "preference"})
store.store("User is building a FastAPI REST API for a logistics company", {"user_id": "u123", "type": "context"})
store.store("User's tech stack: Python 3.12, FastAPI, PostgreSQL, Docker", {"user_id": "u123", "type": "context"})

# On next session, retrieve relevant memories
query = "How should I structure my database models?"
memories = store.retrieve(query)
memory_prompt = store.format_for_prompt(memories)
print(memory_prompt)

Episodic Memory Patterns

Episodic memory stores structured records of past tasks — what was attempted, what succeeded, what failed, and what was learned. Unlike long-term vector memory (which stores facts), episodic memory stores experiences with outcomes. This enables agents to learn from mistakes and refer back to previous successful strategies for similar problems.

import json
from datetime import datetime
from dataclasses import dataclass, asdict
from pathlib import Path
from openai import OpenAI

client = OpenAI()

@dataclass
class Episode:
    task: str
    approach: str
    outcome: str           # "success" | "failure" | "partial"
    result_summary: str
    lessons_learned: str
    timestamp: str = ""
    tags: list = None

    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = datetime.utcnow().isoformat()
        if self.tags is None:
            self.tags = []

class EpisodicMemory:
    """Store and retrieve structured past experiences."""

    def __init__(self, storage_path: str = "episodic_memory.json"):
        self.path = Path(storage_path)
        self.episodes: list[Episode] = []
        if self.path.exists():
            data = json.loads(self.path.read_text())
            self.episodes = [Episode(**e) for e in data]

    def record(self, episode: Episode):
        self.episodes.append(episode)
        self.path.write_text(json.dumps([asdict(e) for e in self.episodes], indent=2))

    def find_similar(self, task: str, max_results: int = 3) -> list[Episode]:
        """Find past episodes with similar tasks using LLM ranking."""
        if not self.episodes:
            return []
        episodes_text = "\n".join(
            f"{i}. Task: {e.task} | Outcome: {e.outcome} | Lesson: {e.lessons_learned}"
            for i, e in enumerate(self.episodes)
        )
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"Current task: {task}\n\nPast episodes:\n{episodes_text}\n\nReturn a JSON array of the indices of the {max_results} most relevant past episodes: [0, 2, ...]"
            }],
            response_format={"type": "json_object"},
            temperature=0,
        )
        import re
        text = response.choices[0].message.content
        indices = json.loads(text)
        if isinstance(indices, dict):
            indices = list(indices.values())[0]
        return [self.episodes[i] for i in indices if i < len(self.episodes)]

    def format_episodes(self, episodes: list[Episode]) -> str:
        if not episodes:
            return ""
        lines = ["[Relevant past experiences:]"]
        for ep in episodes:
            lines.append(f"Task: {ep.task}")
            lines.append(f"  Approach: {ep.approach}")
            lines.append(f"  Outcome: {ep.outcome} — {ep.result_summary}")
            lines.append(f"  Lesson: {ep.lessons_learned}")
        return "\n".join(lines)

# Usage
em = EpisodicMemory()
em.record(Episode(
    task="Optimize slow PostgreSQL query on orders table",
    approach="Added composite index on (user_id, created_at), rewrote subquery as JOIN",
    outcome="success",
    result_summary="Query time reduced from 4.2s to 0.08s",
    lessons_learned="Always check for missing indexes before rewriting queries",
    tags=["postgresql", "performance"]
))

Redis-Backed Conversation Memory

Redis is the ideal backend for conversation memory in production: fast in-memory reads/writes, automatic TTL expiration for stale sessions, and JSON serialization for structured data. Store conversation history keyed by session ID, with a configurable TTL so old sessions are automatically evicted.

import json
import redis
from openai import OpenAI

client = OpenAI()
r = redis.Redis(host="localhost", port=6379, decode_responses=True)

SESSION_TTL = 60 * 60 * 24  # 24 hours

def get_history(session_id: str) -> list[dict]:
    data = r.get(f"session:{session_id}")
    return json.loads(data) if data else []

def save_history(session_id: str, messages: list[dict]):
    r.setex(f"session:{session_id}", SESSION_TTL, json.dumps(messages))

def chat(session_id: str, user_message: str, system_prompt: str = "") -> str:
    history = get_history(session_id)
    history.append({"role": "user", "content": user_message})

    # Keep last 20 turns to avoid token overflow
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.extend(history[-20:])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=512,
    )
    reply = response.choices[0].message.content
    history.append({"role": "assistant", "content": reply})
    save_history(session_id, history)
    return reply

# Usage: session persists across requests
session = "user-123-session-456"
print(chat(session, "My name is Alice and I'm building a FastAPI app."))
print(chat(session, "What was my name again?"))  # Should remember "Alice"

Semantic Memory and Knowledge Bases

Semantic memory provides the agent with stable, factual knowledge — product documentation, company policies, domain expertise. Unlike episodic memory (which changes with experience), semantic memory is curated and updated deliberately. It is implemented as a RAG pipeline over a vector database, injected into the system prompt on each relevant query.

from openai import OpenAI
import numpy as np

client = OpenAI()

KNOWLEDGE_BASE = [
    "Our refund policy: full refund within 30 days of purchase, no questions asked.",
    "Shipping takes 3-5 business days for standard, 1-2 days for express.",
    "Customer support hours: Monday to Friday, 9 AM to 6 PM IST.",
    "Products come with a 1-year manufacturer warranty.",
    "Bulk orders of 10+ units qualify for 15% discount with code BULK15.",
]

def embed(texts):
    return np.array(client.embeddings.create(model="text-embedding-3-small", input=texts).data[0].embedding
                    if isinstance(texts, str) else
                    [item.embedding for item in client.embeddings.create(model="text-embedding-3-small", input=texts).data])

kb_embeddings = np.array([
    client.embeddings.create(model="text-embedding-3-small", input=[doc]).data[0].embedding
    for doc in KNOWLEDGE_BASE
])

def agent_with_knowledge(user_query: str) -> str:
    q_vec = np.array(client.embeddings.create(model="text-embedding-3-small", input=[user_query]).data[0].embedding)
    scores = kb_embeddings @ q_vec / (np.linalg.norm(kb_embeddings, axis=1) * np.linalg.norm(q_vec))
    top3 = np.argsort(scores)[::-1][:3]
    context = "\n".join(f"- {KNOWLEDGE_BASE[i]}" for i in top3 if scores[i] > 0.5)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"You are a customer support agent.\n\nRelevant knowledge:\n{context}"},
            {"role": "user", "content": user_query}
        ],
        max_tokens=256, temperature=0,
    )
    return response.choices[0].message.content

Production Memory Architecture

A production agent memory system combines all four memory types in a layered architecture. The system prompt always contains: the agent's persona and instructions, relevant semantic knowledge (RAG), retrieved long-term memories, and a rolling summary of past sessions. The active message history holds recent turns. Episodic memory informs the agent's reasoning strategy before it begins a task.

Key operational concerns: memory TTLs must be set carefully (too short and the agent forgets users; too long and stale memories mislead it). Memory extraction should be selective — not every message deserves to be stored long-term. Use a lightweight LLM (GPT-4o-mini or Claude Haiku) to classify and extract memorable facts from conversation turns without adding significant latency.

Memory extraction prompt: After each assistant turn, run: "From this conversation turn, extract any facts worth remembering long-term about the user (preferences, context, decisions). If nothing is memorable, return an empty JSON array. Return: [{content, type, importance: 1-5}]" — only store items with importance >= 3.