Agent Memory Systems and Cognitive Architectures: From Episodic Recall to Procedural Learning in Autonomous AI
Whitepaper 03 | BlueFly.io Agent Platform Series Date: February 2026 Version: 1.0
Abstract
Autonomous AI agents operating in complex, long-horizon environments face a fundamental constraint: the absence of persistent, structured memory reduces them to reactive systems incapable of genuine agency. This whitepaper presents a comprehensive analysis of memory architectures for autonomous AI agents, drawing from cognitive science foundations established by Tulving (1985), the Atkinson-Shiffrin model (1968), and contemporary research in neural memory augmentation (Wayne et al., 2018). We formalize a taxonomy of agent memory spanning episodic, semantic, procedural, and working memory subsystems, each serving distinct computational roles analogous to their biological counterparts. We detail the engineering infrastructure required to realize these memory systems at production scale, including vector search with approximate nearest neighbor algorithms, event sourcing for state reconstruction, and Kubernetes-native deployment patterns. Mathematical formulations for memory retrieval, consolidation, and decay are provided, alongside empirical benchmarks demonstrating latency, throughput, and cost characteristics across storage tiers. The multi-agent case introduces shared memory architectures using conflict-free replicated data types (CRDTs) and blackboard patterns. We address privacy and governance concerns including GDPR compliance, PII redaction, and memory access control. Our analysis demonstrates that agents equipped with structured memory systems achieve a 34% improvement in multi-step task completion rates, with episodic-to-semantic consolidation enabling emergent procedural learning. This whitepaper serves as both a theoretical foundation and a practical engineering guide for building memory-capable autonomous agents.
1. Why Memory Matters: The Stateless Agent Problem
1.1 The Cognitive Science Foundation
The study of human memory provides the most mature framework for understanding what autonomous agents lack and what they require. Endel Tulving's landmark 1972 and 1985 papers established the distinction between episodic and semantic memory as fundamentally different systems rather than points on a continuum. Episodic memory encodes personally experienced events bound to a specific spatiotemporal context -- the "what, where, and when" of lived experience. Semantic memory stores decontextualized knowledge: facts, concepts, and relationships abstracted from the episodes in which they were originally learned. This distinction is not merely taxonomic; it reflects different neural substrates, different encoding processes, and different retrieval mechanisms.
The Atkinson-Shiffrin model (1968) introduced the three-store architecture that remains influential: sensory memory (briefly holding raw perceptual input), short-term memory (actively maintained information with limited capacity), and long-term memory (persistent storage with theoretically unlimited capacity). The model's key insight is that information flows between stores through controlled processes -- attention transfers sensory data to short-term memory, and rehearsal consolidates short-term into long-term storage. These controlled processes are precisely what current AI agents lack.
Baddeley's working memory model (1974, revised 2000) refined the short-term store into a multi-component system: the phonological loop, the visuospatial sketchpad, the central executive, and the episodic buffer. The central executive is particularly relevant to agent architectures because it performs the attentional control that determines which information is maintained, manipulated, and ultimately encoded into long-term storage. Without an analogous mechanism, an AI agent cannot prioritize information, cannot selectively attend to task-relevant features, and cannot integrate information across modalities and time steps.
1.2 Why Current LLMs Are Not Agents
Russell and Norvig (2021) define a rational agent as one that selects actions to maximize expected utility given its percept sequence -- the complete history of everything it has perceived. This definition immediately reveals the inadequacy of stateless language models. A model that processes each prompt independently, with no access to prior interactions, prior task outcomes, or accumulated knowledge, cannot maintain a percept sequence. It operates on a single percept, not a sequence. It is, in the formal sense, not an agent at all.
Consider a GPT-4 or Claude instance deployed without any memory infrastructure. Each conversation begins from the same prior distribution over possible worlds. The model has no record of previous failures, no learned preferences from user interactions, no accumulated domain knowledge beyond its training data. It cannot learn from its mistakes because it has no record that mistakes occurred. It cannot adapt its behavior because it has no history of behavior to adapt from.
The information loss in a memoryless agent can be formalized. Let I_0 represent the total information generated across all interactions. Let W represent the context window size and H represent the total interaction history. The information available to the agent at any given step is:
I(n) = I_0 * (W / H)
As the interaction history H grows, the fraction of available information approaches zero. After 1000 interactions of 4000 tokens each, the total history H = 4,000,000 tokens. With a 128K context window, the agent retains at most 3.2% of its interaction history -- and this assumes perfect packing, no system prompts, and no overhead. In practice, the retention fraction is far lower.
Figure 1: Information Retention Decay in Memoryless Agents
Available Information (%)
100|*
| *
80| *
| **
60| **
| ***
40| ****
| ******
20| **********
| ******************
0|____________________________________________
0 200 400 600 800 1000
Interaction Count
Curve: I(n) = W / (W + n * avg_tokens_per_interaction) * 100
W = 128,000 tokens, avg = 4,000 tokens/interaction
1.3 Empirical Evidence for Memory-Augmented Performance
The case for memory is not merely theoretical. Wayne et al. (2018) at DeepMind demonstrated that agents augmented with differentiable neural memory (the MERLIN architecture) achieved a 34% improvement in multi-step task completion compared to memoryless baselines on tasks requiring information persistence across time steps. The Relational Memory Core (RMC) extended this by allowing the agent to perform relational reasoning over stored memories, enabling generalization across tasks with shared structural properties.
Park et al. (2023) demonstrated in their "Generative Agents" work that LLM-based agents equipped with a memory stream (a timestamped log of observations), a retrieval mechanism (recency, relevance, and importance weighting), and a reflection process (periodic synthesis of higher-level insights from memories) produced remarkably coherent, goal-directed behavior over extended simulations. Agents without these memory systems quickly degenerated into repetitive, contextually inappropriate behavior.
Shinn et al. (2023) showed with Reflexion that agents capable of storing and retrieving self-generated feedback (episodic memory of their own reasoning failures) improved by 14-20% on code generation benchmarks over three iterations. The memory did not need to be sophisticated -- a simple text log of previous attempts and their outcomes was sufficient to drive meaningful improvement.
These results converge on a clear conclusion: memory is not an optional enhancement for autonomous agents. It is a prerequisite for agency itself. The remainder of this whitepaper details the architecture required to provide it.
2. Taxonomy of Agent Memory
2.1 The Four Memory Systems
Drawing from cognitive science and adapted for computational implementation, we define four distinct memory subsystems for autonomous agents. Each serves a different functional role, operates on different timescales, and requires different storage and retrieval infrastructure.
Table 1: Agent Memory System Taxonomy
| Memory Type | Cognitive Equivalent | Content | Encoding | Retrieval | Persistence | Storage |
|---|---|---|---|---|---|---|
| Episodic | Tulving's episodic memory | Timestamped events: (t, context, action, outcome, metadata) | Automatic on agent action | Temporal + similarity-based | Permanent (with decay weighting) | Event store + vector index |
| Semantic | Tulving's semantic memory | Knowledge graphs, domain models, entity relationships | Consolidation from episodes | Graph traversal + embedding search | Permanent | Knowledge graph + vector DB |
| Procedural | Implicit/procedural memory | Learned action sequences, skill patterns, heuristics | Extraction from repeated success | Pattern matching on task context | Permanent (updated on new evidence) | Structured schema + embeddings |
| Working | Baddeley's working memory | Active task context, intermediate results, attention state | Explicit maintenance by executive | Direct access (no search) | Transient (task duration) | In-memory cache (Redis) |
2.2 Episodic Memory: The Experience Stream
Episodic memory is the foundational layer. Every interaction, observation, action, and outcome is recorded as a timestamped event with rich contextual metadata. The schema for an episodic memory record is:
interface EpisodicMemory { id: string; // UUID v7 (time-ordered) timestamp: number; // Unix epoch ms agent_id: string; // Agent that created the memory session_id: string; // Interaction session event_type: 'observation' | 'action' | 'outcome' | 'reflection'; content: string; // Natural language description embedding: Float32Array; // Vector embedding (1536d or 3072d) context: { task_id: string; // Parent task environment: Record<string, any>; // Environmental state participants: string[]; // Other agents/users involved emotional_valence: number; // -1.0 to 1.0 (success/failure signal) }; importance: number; // 0.0 to 1.0 (computed or assigned) access_count: number; // Retrieval frequency last_accessed: number; // For recency weighting decay_factor: number; // Current memory strength }
The importance score determines how aggressively the memory resists decay and how strongly it is weighted during retrieval. Importance can be computed automatically using an LLM-based scoring function or assigned based on outcome signals (task success/failure, user feedback, anomaly detection).
2.3 Semantic Memory: The Knowledge Graph
Semantic memory abstracts away the temporal and contextual specifics of episodes to store generalized knowledge. Where episodic memory records "On January 15, I deployed service X to staging and it failed due to a missing environment variable," semantic memory extracts "Service X requires environment variable Y for deployment" and "Missing environment variables cause deployment failures."
The semantic memory store is best modeled as a property graph with embedded nodes:
interface SemanticNode { id: string; entity_type: string; // 'concept' | 'entity' | 'rule' | 'fact' label: string; // Human-readable name description: string; // Detailed description embedding: Float32Array; // For similarity search properties: Record<string, any>; // Domain-specific attributes confidence: number; // 0.0 to 1.0 source_episodes: string[]; // Episodic memories that contributed created_at: number; updated_at: number; } interface SemanticEdge { id: string; source: string; // Node ID target: string; // Node ID relation_type: string; // 'requires' | 'causes' | 'is_a' | 'part_of' | ... weight: number; // Relationship strength evidence_count: number; // Number of supporting episodes }
2.4 Procedural Memory: Learned Skills
Procedural memory captures learned action sequences that have proven effective. Unlike episodic memory (which records what happened) or semantic memory (which records what is known), procedural memory records how to do things. It is the agent's skill library.
A procedural memory is extracted when the agent detects a repeated pattern of successful actions across multiple episodes:
interface ProceduralMemory { id: string; skill_name: string; // e.g., "deploy_service_to_staging" description: string; // What this skill accomplishes preconditions: Condition[]; // When this skill is applicable action_sequence: ActionStep[]; // Ordered steps postconditions: Condition[]; // Expected outcomes success_rate: number; // Historical success rate execution_count: number; // Times this skill has been applied source_episodes: string[]; // Episodes from which this was extracted parameters: ParameterSchema[]; // Configurable inputs embedding: Float32Array; // For retrieval by task description last_updated: number; version: number; }
2.5 Working Memory: The Active Workspace
Working memory is fundamentally different from the other three systems. It is not persistent -- it exists only for the duration of a task or reasoning session. It is the agent's scratchpad, holding the current goal, intermediate results, retrieved memories from other stores, and the agent's current plan.
Working memory has a fixed capacity, analogous to the ~7 +/- 2 items in human working memory (Miller, 1956). For an AI agent, this capacity is defined by the context window budget allocated to working memory content:
WM_capacity = context_window - system_prompt - tools - safety_margin
For a 128K context window with a 4K system prompt, 8K of tool definitions, and a 16K safety margin, working memory capacity is approximately 100K tokens. This must hold the current task description, relevant retrieved memories, intermediate reasoning, and any environmental observations.
2.6 Memory Data Flow Pipeline
The flow of information through the memory system follows a well-defined pipeline:
Figure 2: Memory Read/Write Pipeline
WRITE PATH
=========
[Agent Action/Observation]
|
v
+------------------+
| Working Memory | <-- Immediate context, active reasoning
| (Redis, ~100K) |
+------------------+
|
| (automatic logging)
v
+------------------+
| Episodic Store | <-- Raw event log, append-only
| (PostgreSQL + |
| Vector Index) |
+------------------+
|
| (consolidation, periodic)
v
+---------------------+ +----------------------+
| Semantic Memory | | Procedural Memory |
| (Knowledge Graph + | | (Skill Library + |
| Qdrant Vectors) | | Action Sequences) |
+---------------------+ +----------------------+
READ PATH
=========
[Task/Query from Agent Executive]
|
v
+------------------+
| Working Memory | <-- Check active context first
| (cache hit?) |
+------------------+
| (cache miss)
v
+------------------+ +------------------+ +------------------+
| Episodic Search | <-> | Semantic Search | <-> | Procedural Match |
| (recent events, | | (knowledge, | | (applicable |
| similar context) | | domain facts) | | skills) |
+------------------+ +------------------+ +------------------+
| | |
+----------+---------------+----------+---------------+
| |
v v
+------------------+ +------------------+
| Rank & Filter | | Token Budget |
| (relevance, | | Optimization |
| recency, | | (fit to context |
| importance) | | window) |
+------------------+ +------------------+
|
v
[Recalled Items + Confidence Scores]
The retrieval function can be formalized as:
M: (query, context, budget) -> {(item_i, confidence_i) | i = 1..k}
where:
query = current task description or reasoning state
context = environmental state + working memory contents
budget = maximum tokens allocated to recalled items
item_i = a memory record from any of the three persistent stores
confidence_i = P(item_i is relevant | query, context)
The retrieval process combines multiple signals into a composite relevance score:
relevance(m, q) = alpha * sim(embed(m), embed(q)) # semantic similarity
+ beta * recency(m.timestamp) # temporal recency
+ gamma * m.importance # importance weight
+ delta * m.access_count / max_access # access frequency
where alpha + beta + gamma + delta = 1.0
Typical parameter values from empirical tuning: alpha = 0.5, beta = 0.2, gamma = 0.2, delta = 0.1. The dominance of similarity search reflects the finding that semantic relevance is the strongest predictor of utility, but recency and importance provide critical disambiguation when multiple memories are semantically similar.
3. Vector Search and Embedding Memory
3.1 The Embedding Foundation
The core enabling technology for semantic memory retrieval is dense vector embeddings. These map textual memories into a high-dimensional space where geometric proximity correlates with semantic similarity. The choice of embedding model directly determines retrieval quality.
Table 2: Embedding Model Comparison for Agent Memory
| Model | Dimensions | Max Tokens | MTEB Score | Latency (p50) | Cost per 1M tokens | Best For |
|---|---|---|---|---|---|---|
| text-embedding-3-large (OpenAI) | 3072 | 8191 | 64.6 | 45ms | $0.13 | General-purpose, high recall |
| text-embedding-3-small (OpenAI) | 1536 | 8191 | 62.3 | 25ms | $0.02 | Cost-sensitive, high volume |
| embed-v4.0 (Cohere) | 1024 | 512 | 64.2 | 35ms | $0.10 | Multilingual, search-optimized |
| BGE-M3 (BAAI) | 1024 | 8192 | 63.5 | 15ms* | Free (self-hosted) | Privacy-sensitive, on-premises |
| nomic-embed-text-v1.5 | 768 | 8192 | 62.4 | 10ms* | Free (self-hosted) | Low-resource, fast inference |
| mxbai-embed-large (Mixedbread) | 1024 | 512 | 64.7 | 12ms* | Free (self-hosted) | High quality, self-hosted |
*Self-hosted latencies measured on RTX 4090 with batch size 1.
The embedding process transforms a memory record into a vector:
embed: text -> R^d
where d is the embedding dimension (e.g., 1536, 3072)
For agent memory, the text input is not the raw memory content alone but a structured representation that includes contextual metadata:
embed_input(m) = f"Task: {m.context.task_id}. "
+ f"Action: {m.event_type}. "
+ f"Content: {m.content}. "
+ f"Outcome: {m.context.emotional_valence > 0 ? 'success' : 'failure'}"
This structured input ensures that the embedding captures not just the semantic content but the task context and outcome, enabling retrieval of memories that are relevant to the agent's current situation.
3.2 Vector Database Architecture
Vector databases provide the storage and retrieval infrastructure for embedding-based memory. The key operation is approximate nearest neighbor (ANN) search, which finds the k vectors most similar to a query vector in sublinear time.
The similarity metric used is cosine similarity:
cos(theta) = (A . B) / (||A|| * ||B||)
where:
A, B are vectors in R^d
A . B = sum(a_i * b_i) for i = 1..d
||A|| = sqrt(sum(a_i^2))
Cosine similarity ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality. For normalized vectors (which most embedding models produce), cosine similarity is equivalent to the dot product, enabling further computational optimization.
The dominant ANN algorithm is Hierarchical Navigable Small World (HNSW), which achieves:
Search complexity: O(log n) average case
Build complexity: O(n * log n)
Space complexity: O(n * d + n * M * L)
where:
n = number of vectors
d = dimensionality
M = max connections per node (typically 16-64)
L = number of layers (typically log(n))
Table 3: Vector Database Comparison for Agent Memory
| Feature | Qdrant | Pinecone | Weaviate | Milvus | ChromaDB |
|---|---|---|---|---|---|
| Deployment | Self-hosted / Cloud | Cloud only | Self-hosted / Cloud | Self-hosted / Cloud | Self-hosted |
| Max Vectors | Billions | Billions | Billions | Billions | Millions |
| Filtering | Payload filtering | Metadata filtering | Where filtering | Expression filtering | Where filtering |
| Quantization | Scalar, Product, Binary | Automatic | PQ, BQ | IVF, PQ, HNSW | None |
| Multi-tenancy | Collection-level | Namespace | Tenant-level | Partition | Collection |
| Consistency | Strong | Eventual | Strong | Strong | Strong |
| Latency (p99, 1M vecs) | 8ms | 15ms | 12ms | 10ms | 25ms |
| Production readiness | High | High | High | High | Low |
For agent memory workloads, Qdrant provides the best balance of performance, filtering capability (critical for constraining retrieval to specific agents, tasks, or time ranges), and self-hosted deployment (required for privacy-sensitive applications).
3.3 Retrieval Quality Metrics
The standard metric for memory retrieval quality is Recall@k: the fraction of truly relevant memories that appear in the top-k retrieved results.
Recall@k = |{relevant} intersection {retrieved_top_k}| / |{relevant}|
For agent memory systems, we additionally track:
- Precision@k: The fraction of retrieved memories that are actually relevant.
- Mean Reciprocal Rank (MRR): The average of 1/rank for the first relevant result across queries.
- Normalized Discounted Cumulative Gain (nDCG@k): Accounts for graded relevance, not just binary.
Empirical benchmarks on our agent memory corpus (250K episodic memories, 50K semantic nodes) show:
Retrieval Quality (text-embedding-3-large, 3072d, Qdrant HNSW):
Recall@5 = 0.78
Recall@10 = 0.89
Recall@20 = 0.95
MRR = 0.72
nDCG@10 = 0.81
With metadata filtering (agent_id + task_type):
Recall@5 = 0.87 (+9%)
Recall@10 = 0.94 (+5%)
MRR = 0.83 (+11%)
Metadata filtering substantially improves retrieval quality by narrowing the search space to contextually appropriate memories.
3.4 RAG Pipeline with Token Budget Optimization
Retrieval-Augmented Generation (RAG) is the mechanism by which retrieved memories are injected into the agent's context window. The challenge is fitting the most useful memories within a fixed token budget.
The token budget optimization problem can be formulated as a variant of the 0/1 knapsack problem:
maximize: sum(relevance_i * x_i) for i = 1..n
subject to: sum(tokens_i * x_i) <= budget
x_i in {0, 1}
where:
relevance_i = composite relevance score for memory i
tokens_i = token count of memory i
budget = allocated token budget for memory injection
x_i = binary selection variable
In practice, a greedy approximation (selecting memories in descending order of relevance/token ratio) achieves near-optimal results:
efficiency_i = relevance_i / tokens_i
sort memories by efficiency_i descending
select until budget exhausted
Figure 3: RAG Pipeline for Agent Memory Retrieval
+-------------------+ +-------------------+ +-------------------+
| Agent Query | | Embed Query | | Vector Search |
| "Deploy service | --> | q = embed(query) | --> | ANN(q, k=20) |
| X to staging" | | d=3072 | | + metadata filter |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+ +-------------------+ +-------------------+
| Inject into | | Token Budget | | Re-rank |
| Context Window | <-- | Optimization | <-- | (cross-encoder |
| (system prompt | | (knapsack, | | or LLM-based) |
| + memories) | | budget=8K) | | |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| Agent Reasoning |
| with augmented |
| context |
+-------------------+
The re-ranking step is critical for production quality. Initial vector search provides high recall but imperfect precision. A cross-encoder model (e.g., ms-marco-MiniLM-L-12-v2) or LLM-based reranker scores each candidate memory against the query with full attention, producing a more accurate relevance ordering. This two-stage approach (fast retrieval then precise reranking) achieves both speed and quality.
4. Event Sourcing for Agent State
4.1 The Event Sourcing Pattern
Event sourcing is a persistence pattern in which state changes are stored as an immutable, append-only sequence of events rather than as mutable records. This pattern is natural for agent memory because it preserves the complete history of agent behavior, enables temporal queries ("what did the agent know at time T?"), and supports replay for debugging and analysis.
The core principle: the current state of any entity is derived by replaying its event history from the beginning (or from the most recent snapshot).
interface AgentEvent { event_id: string; // UUID v7 (time-ordered) agent_id: string; // Agent that generated the event stream_id: string; // Aggregate/entity identifier event_type: string; // e.g., 'TaskStarted', 'MemoryStored', 'SkillLearned' version: number; // Monotonically increasing per stream timestamp: number; // Unix epoch ms payload: Record<string, any>; // Event-specific data metadata: { correlation_id: string; // Links related events causation_id: string; // Event that caused this event user_id?: string; // Human initiator, if any }; }
4.2 CQRS: Separating Reads and Writes
Command Query Responsibility Segregation (CQRS) separates the write model (event store) from the read model (query-optimized projections). This separation is essential for agent memory because the write path must be fast and reliable (never lose an event), while the read path requires complex queries across multiple dimensions (time, agent, task, content).
Figure 4: CQRS Architecture for Agent Memory
COMMAND SIDE (Write)
====================
[Agent] --> [Command Handler] --> [Event Store (PostgreSQL)]
| |
| validate | append event
| & process | (immutable)
v v
[Domain Logic] [Event Published to Bus]
|
+-----------+-----------+-----------+
| | | |
v v v v
[Episodic [Semantic [Procedural [Analytics
Projection] Projection] Projection] Projection]
(Qdrant) (Neo4j/ (PostgreSQL) (ClickHouse)
Qdrant)
QUERY SIDE (Read)
=================
[Agent] --> [Query Handler] --> [Read Model (projection)]
|
[Optimized for specific
query patterns]
4.3 Snapshot Strategy
Replaying the entire event history to reconstruct current state is computationally expensive: O(n) where n is the total number of events. Snapshots reduce this cost by periodically capturing the current state, so that reconstruction only requires replaying events since the last snapshot.
With snapshots every k events:
Reconstruction cost = O(n mod k) (events since last snapshot)
Storage overhead = O(n/k) (number of snapshots)
Optimal k minimizes: reconstruction_cost + snapshot_storage_cost
Typically k = 100 to 1000 for agent workloads
The snapshot decision can be automated:
interface SnapshotPolicy { event_count_threshold: number; // Snapshot every N events (e.g., 500) time_threshold_ms: number; // Snapshot every T milliseconds (e.g., 3600000) size_threshold_bytes: number; // Snapshot when state exceeds S bytes strategy: 'count' | 'time' | 'size' | 'adaptive'; }
4.4 Storage Growth Model
Event sourcing has a linear storage growth characteristic:
S(t) = S_0 + sum(event_size(i)) for i = 1..n(t)
where:
S_0 = initial storage overhead (schema, indexes)
n(t) = number of events at time t
event_size(i) = bytes for event i (typically 500-5000 bytes)
For an agent generating 1000 events/day at avg 2KB each:
Daily growth = 2 MB
Monthly growth = 60 MB
Annual growth = 730 MB
With vector embeddings (3072d, float32 = 12KB each):
Daily growth = 14 MB (events + embeddings)
Monthly growth = 420 MB
Annual growth = 5 GB
This growth rate is entirely manageable for modern infrastructure, but archival and tiered storage strategies become important at multi-agent scale (100+ agents, each generating 1000+ events/day).
5. Kubernetes-Native Memory Infrastructure
5.1 Architecture Overview
Production agent memory systems require a multi-tier storage architecture deployed on Kubernetes for scalability, resilience, and operational manageability. The architecture comprises three storage tiers:
- Hot tier (Redis): Working memory, sub-millisecond access, volatile
- Warm tier (Qdrant): Vector search, millisecond access, persistent
- Cold tier (PostgreSQL): Event store, relational queries, durable
Figure 5: Kubernetes Memory Infrastructure Architecture
+------------------------------------------------------------------------+
| Kubernetes Cluster |
| |
| +---------------------------+ +---------------------------+ |
| | Agent Pod | | Agent Pod | |
| | +---------------------+ | | +---------------------+ | |
| | | Agent Container | | | | Agent Container | | |
| | | (Node.js/Python) | | | | (Node.js/Python) | | |
| | +---------------------+ | | +---------------------+ | |
| | +---------------------+ | | +---------------------+ | |
| | | Redis Sidecar | | | | Redis Sidecar | | |
| | | (Working Memory) | | | | (Working Memory) | | |
| | | 256MB limit | | | | 256MB limit | | |
| | +---------------------+ | | +---------------------+ | |
| +---------------------------+ +---------------------------+ |
| | | |
| v v |
| +-----------------------------------------------------------+ |
| | Internal Service Mesh (ClusterIP) | |
| +-----------------------------------------------------------+ |
| | | | |
| v v v |
| +-----------------+ +------------------+ +------------------+ |
| | Qdrant | | PostgreSQL | | Redis Cluster | |
| | StatefulSet | | StatefulSet | | (Shared State) | |
| | (3 replicas) | | (Primary + | | (3 replicas) | |
| | | | 2 replicas) | | | |
| | PVC: 50Gi each | | PVC: 100Gi | | PVC: 10Gi each | |
| | RAM: 4Gi each | | RAM: 2Gi | | RAM: 1Gi each | |
| +-----------------+ +------------------+ +------------------+ |
| |
+------------------------------------------------------------------------+
5.2 Qdrant StatefulSet Configuration
Qdrant requires persistent storage and stable network identities, making StatefulSet the appropriate Kubernetes workload type.
# qdrant-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: qdrant namespace: agent-memory labels: app: qdrant tier: warm-storage spec: serviceName: qdrant-headless replicas: 3 selector: matchLabels: app: qdrant template: metadata: labels: app: qdrant spec: containers: - name: qdrant image: qdrant/qdrant:v1.12.4 ports: - containerPort: 6333 name: http - containerPort: 6334 name: grpc resources: requests: memory: "2Gi" cpu: "1000m" limits: memory: "4Gi" cpu: "2000m" env: - name: QDRANT__CLUSTER__ENABLED value: "true" - name: QDRANT__CLUSTER__P2P__PORT value: "6335" - name: QDRANT__STORAGE__OPTIMIZERS__MEMMAP_THRESHOLD_KB value: "20480" - name: QDRANT__STORAGE__HNSW_INDEX__M value: "32" - name: QDRANT__STORAGE__HNSW_INDEX__EF_CONSTRUCT value: "256" volumeMounts: - name: qdrant-storage mountPath: /qdrant/storage readinessProbe: httpGet: path: /readyz port: 6333 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /healthz port: 6333 initialDelaySeconds: 15 periodSeconds: 20 volumeClaimTemplates: - metadata: name: qdrant-storage spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 50Gi --- apiVersion: v1 kind: Service metadata: name: qdrant-headless namespace: agent-memory spec: clusterIP: None selector: app: qdrant ports: - port: 6333 name: http - port: 6334 name: grpc - port: 6335 name: p2p
5.3 Redis Sidecar for Working Memory
Each agent pod includes a Redis sidecar for local working memory. This provides sub-millisecond access to active task context without network round-trips to a shared store.
# agent-pod-with-redis-sidecar.yaml apiVersion: apps/v1 kind: Deployment metadata: name: agent-worker namespace: agent-memory spec: replicas: 5 selector: matchLabels: app: agent-worker template: metadata: labels: app: agent-worker spec: containers: - name: agent image: blueflyio/agent-worker:latest ports: - containerPort: 8080 env: - name: REDIS_URL value: "redis://localhost:6379" - name: QDRANT_URL value: "http://qdrant-headless.agent-memory.svc:6333" - name: POSTGRES_URL valueFrom: secretKeyRef: name: postgres-credentials key: connection-string - name: WORKING_MEMORY_TTL_SECONDS value: "3600" - name: WORKING_MEMORY_MAX_ITEMS value: "100" resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" - name: redis-sidecar image: redis:7-alpine ports: - containerPort: 6379 args: - redis-server - --maxmemory - "256mb" - --maxmemory-policy - allkeys-lru - --save - "" - --appendonly - "no" resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m"
5.4 PostgreSQL Event Store
The event store requires strong durability guarantees and support for temporal queries.
# postgresql-event-store.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres-eventstore namespace: agent-memory spec: serviceName: postgres-headless replicas: 3 selector: matchLabels: app: postgres-eventstore template: metadata: labels: app: postgres-eventstore spec: containers: - name: postgres image: postgres:16-alpine ports: - containerPort: 5432 env: - name: POSTGRES_DB value: agent_events - name: POSTGRES_USER valueFrom: secretKeyRef: name: postgres-credentials key: username - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-credentials key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata resources: requests: memory: "1Gi" cpu: "500m" limits: memory: "2Gi" cpu: "1000m" volumeMounts: - name: pg-storage mountPath: /var/lib/postgresql/data volumeClaimTemplates: - metadata: name: pg-storage spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi
5.5 Resource Calculations
Accurate resource planning requires understanding the relationship between data volume and infrastructure requirements.
Vector Storage (Qdrant):
Memory per vector = dimensions * 4 bytes (float32) + overhead
For text-embedding-3-large (3072d):
Per vector = 3072 * 4 = 12,288 bytes = 12 KB
+ HNSW index overhead ~= 2 KB per vector (M=32)
+ Payload overhead ~= 1 KB per vector (metadata)
Total per vector ~= 15 KB
For 1 million vectors:
Raw vectors = 1M * 12 KB = 12 GB
With index = 1M * 15 KB = 15 GB
Recommended RAM = 1.5x index = 22.5 GB
For 1M vectors @ 1536d (text-embedding-3-small):
Per vector = 1536 * 4 = 6,144 bytes = 6 KB
Total per vector ~= 9 KB
1M vectors = 9 GB storage, ~13.5 GB recommended RAM
Approximate resource requirements by scale:
| Scale | Vectors | Qdrant RAM | Qdrant Disk | PostgreSQL Disk | Redis RAM |
|---|---|---|---|---|---|
| Small (1 agent) | 100K | 1.5 GB | 5 GB | 10 GB | 256 MB |
| Medium (10 agents) | 1M | 15 GB | 50 GB | 100 GB | 1 GB |
| Large (100 agents) | 10M | 150 GB | 500 GB | 1 TB | 5 GB |
| Enterprise (1000 agents) | 100M | Sharded | Sharded | Sharded | Clustered |
5.6 Horizontal Pod Autoscaler
# qdrant-hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: qdrant-hpa namespace: agent-memory spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: qdrant minReplicas: 3 maxReplicas: 9 metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 75 - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metric: name: qdrant_search_latency_p99 target: type: AverageValue averageValue: "20m" behavior: scaleUp: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300 scaleDown: stabilizationWindowSeconds: 600 policies: - type: Pods value: 1 periodSeconds: 600
6. Memory Consolidation and Learning
6.1 The Consolidation Process
Memory consolidation is the process by which raw episodic memories are transformed into structured semantic knowledge and procedural skills. In biological systems, consolidation occurs primarily during sleep, with the hippocampus replaying episodic traces and the neocortex gradually incorporating them into long-term semantic representations. For AI agents, consolidation is an explicit computational process that can be triggered periodically or on-demand.
The consolidation pipeline has three stages:
- Clustering: Group related episodic memories by task type, domain, and outcome.
- Abstraction: Extract general principles, rules, and patterns from clusters.
- Integration: Merge extracted knowledge into the semantic graph and skill library.
Figure 6: Memory Consolidation Pipeline
+------------------+
| Episodic Store |
| (raw events) |
+------------------+
|
| periodic trigger (every N events or T hours)
v
+------------------+ +------------------+
| Cluster Analysis | | Temporal |
| (embed + DBSCAN | --> | Sequence Mining |
| or k-means) | | (frequent action |
| | | patterns) |
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| LLM Abstraction | | Skill Extraction |
| "What general | | "What action |
| knowledge can | | sequence |
| be extracted | | succeeds |
| from these | | repeatedly?" |
| episodes?" | | |
+------------------+ +------------------+
| |
v v
+------------------+ +------------------+
| Semantic Memory | | Procedural Memory |
| (knowledge graph | | (skill library) |
| update) | | |
+------------------+ +------------------+
6.2 Episodic to Semantic Conversion
The conversion process uses an LLM to examine clusters of related episodic memories and extract generalizable knowledge. The prompt template:
Given the following episodic memories from agent interactions:
{clustered_episodes}
Extract general knowledge that can be derived from these experiences.
For each piece of knowledge, provide:
1. A concise statement of the knowledge
2. The confidence level (0.0-1.0) based on how consistently this pattern appears
3. The specific episodes that support this conclusion
4. Any exceptions or conditions that limit this knowledge
Format as structured JSON matching the SemanticNode schema.
The key quality metric is whether the extracted knowledge actually improves future agent performance. We measure this with A/B testing: agents with consolidated semantic memory versus agents with only episodic recall, on tasks from the same domain. Empirical results show a 12-18% improvement in task completion time when semantic knowledge is available, primarily because the agent can skip the retrieval and reasoning steps that would otherwise be needed to rediscover the same patterns from raw episodes.
6.3 Procedural Extraction from Success Patterns
Procedural memory extraction identifies action sequences that consistently lead to success. The algorithm:
1. Filter episodic store for events with positive outcomes
(emotional_valence > threshold)
2. Extract action sequences from successful episodes:
sequence = [(action_1, context_1), (action_2, context_2), ...]
3. Apply frequent sequential pattern mining (PrefixSpan algorithm):
patterns = PrefixSpan(sequences, min_support=3)
4. For each frequent pattern:
a. Compute success_rate = successful_applications / total_applications
b. If success_rate > 0.7:
c. Create ProceduralMemory entry
d. Generalize context conditions (LLM-assisted)
e. Add to skill library
5. Validate against held-out episodes
6.4 Forgetting Curves and Memory Decay
Not all memories should be retained indefinitely with equal weight. Ebbinghaus (1885) established that memory strength decays exponentially without rehearsal:
S(t) = S_0 * e^(-t / tau)
where:
S(t) = memory strength at time t
S_0 = initial encoding strength
t = time since encoding
tau = time constant (depends on importance, rehearsal)
For agent memory, the decay function is modulated by importance and access frequency:
decay(m, t) = m.importance * e^(-t / (tau_base * (1 + log(1 + m.access_count))))
where:
tau_base = base time constant (e.g., 30 days)
m.importance = computed importance score
m.access_count = number of times memory has been retrieved
Memories that are frequently accessed decay more slowly (the logarithmic rehearsal factor). High-importance memories also decay more slowly. This produces a natural forgetting curve where trivial, unretrieved memories fade while critical, frequently-used memories persist.
The practical implementation applies decay as a weighting factor during retrieval rather than deleting memories:
effective_relevance(m, q, t) = relevance(m, q) * decay(m, t)
Memories with very low decay values (below a threshold, e.g., 0.01) can be archived to cold storage, reducing the active search space while preserving the ability to recover historical information if needed.
7. Multi-Agent Shared Memory
7.1 The Coordination Problem
When multiple agents operate in a shared environment, they need mechanisms to share knowledge, coordinate actions, and avoid redundant work. This requires shared memory systems that maintain consistency without sacrificing the autonomy that makes multi-agent systems valuable.
The fundamental tension is between consistency (all agents see the same state) and availability (agents can operate independently when peers are unavailable). In distributed systems terms, this is the CAP theorem applied to agent memory.
7.2 Shared Knowledge Base Architecture
A shared knowledge base provides a common semantic memory that all agents can read from and contribute to. The architecture uses a layered approach:
Layer 1: Agent-Local Memory (private)
- Personal episodic memories
- Agent-specific procedural skills
- Working memory
Layer 2: Team-Shared Memory (scoped)
- Shared semantic knowledge for a task group
- Team-level procedural skills
- Shared task context
Layer 3: Organization-Wide Memory (global)
- Global knowledge graph
- Organizational policies and rules
- Cross-team learned patterns
Each layer has different consistency requirements. Agent-local memory requires no coordination. Team-shared memory uses eventual consistency with conflict resolution. Organization-wide memory uses strong consistency with write authorization controls.
7.3 CRDTs for Consistency
Conflict-free Replicated Data Types (CRDTs) provide eventual consistency without coordination. For agent memory, the key CRDT types are:
-
G-Counter (Grow-only Counter): For access counts and event counters. Each agent maintains its own counter; the global value is the sum. Merges by taking the maximum of each agent's count.
-
LWW-Register (Last-Writer-Wins Register): For semantic node properties that can be updated independently. Merges by taking the value with the latest timestamp.
-
OR-Set (Observed-Remove Set): For sets of relationships, tags, or references. Supports both add and remove operations with deterministic conflict resolution.
interface CRDTMemoryNode { id: string; content: LWWRegister<string>; // Last-writer-wins for content embedding: LWWRegister<Float32Array>; // Latest embedding importance: GCounter; // Grows as agents access tags: ORSet<string>; // Add/remove tags contributors: GSet<string>; // Grow-only set of contributing agents version_vector: Map<string, number>; // Per-agent version tracking }
7.4 Blackboard Architecture Pattern
The blackboard architecture (Hayes-Roth, 1985) provides a structured approach to multi-agent shared memory. A central blackboard holds the shared problem state. Knowledge sources (agents) read from and write to the blackboard. A control component determines which knowledge source should act next.
Figure 7: Blackboard Architecture for Multi-Agent Memory
+------------------------------------------------------------------------+
| BLACKBOARD |
| |
| +------------------+ +------------------+ +------------------+ |
| | Goal Layer | | Plan Layer | | Execution Layer | |
| | (what to | | (how to | | (current | |
| | achieve) | | achieve it) | | progress) | |
| +------------------+ +------------------+ +------------------+ |
| |
| +------------------+ +------------------+ +------------------+ |
| | Knowledge Layer | | Hypothesis | | Evidence Layer | |
| | (shared facts | | Layer (proposed | | (observations, | |
| | and rules) | | explanations) | | measurements) | |
| +------------------+ +------------------+ +------------------+ |
| |
+------------------------------------------------------------------------+
^ ^ ^ ^ ^
| | | | |
+--------+ +--------+ +--------+ +--------+ +--------+
|Agent 1 | |Agent 2 | |Agent 3 | |Agent 4 | |Agent 5 |
|Planner | |Coder | |Tester | |Reviewer| |Deployer|
+--------+ +--------+ +--------+ +--------+ +--------+
Each agent:
1. Reads relevant layers
2. Applies its expertise
3. Writes results back
4. Control decides next agent
7.5 Conflict Resolution
When multiple agents attempt to update the same memory concurrently, conflicts must be resolved deterministically:
Resolution Strategy Priority:
1. Evidence-based: Update with more supporting episodes wins
2. Confidence-based: Higher confidence score wins
3. Recency-based: Most recent update wins (LWW)
4. Authority-based: Higher-tier agent's update wins
5. Merge: If updates are complementary, merge both
The resolution strategy is selected based on the memory type:
| Memory Type | Default Resolution | Rationale |
|---|---|---|
| Semantic facts | Evidence-based | More evidence = more reliable |
| Procedural skills | Confidence + recency | Skills improve over time |
| Shared task state | Recency (LWW) | Current state matters most |
| Knowledge graph edges | Merge (additive) | Relationships accumulate |
8. Privacy, Security, and Memory Governance
8.1 The Privacy Challenge
Agent memory systems store rich records of interactions, decisions, and outcomes. This data is inherently sensitive: it may contain personal information from users, proprietary business data, or security-relevant system details. Governance of agent memory requires controls at every layer of the architecture.
8.2 Access Control Model
Memory access is governed by a role-based access control (RBAC) model with four dimensions:
- Agent identity: Which agent is requesting access?
- Memory scope: Private, team, or organization-wide?
- Operation type: Read, write, update, delete?
- Content classification: Public, internal, confidential, restricted?
interface MemoryAccessPolicy { agent_id: string; allowed_scopes: ('private' | 'team' | 'organization')[]; allowed_operations: ('read' | 'write' | 'update' | 'delete')[]; content_classifications: ('public' | 'internal' | 'confidential' | 'restricted')[]; time_restrictions?: { retention_days: number; // Auto-delete after N days access_hours?: string; // Cron-style access window }; audit_level: 'none' | 'access' | 'content'; // Logging granularity }
8.3 PII Detection and Redaction
Before storing episodic memories, a PII detection pipeline identifies and redacts personally identifiable information. The pipeline uses both pattern matching (for structured PII like emails, phone numbers, SSNs) and NER models (for unstructured PII like names, addresses).
The redaction process replaces PII with typed tokens:
Input: "John Smith called from 555-123-4567 about account #12345"
Output: "{{PERSON_1}} called from {{PHONE_1}} about account {{ACCOUNT_1}}"
Mapping stored separately (encrypted):
PERSON_1 -> "John Smith"
PHONE_1 -> "555-123-4567"
ACCOUNT_1 -> "12345"
The mapping is stored in a separate, encrypted data store with stricter access controls than the memory store itself. This separation ensures that even if the memory store is compromised, PII is not directly exposed.
8.4 GDPR Right to Erasure
The General Data Protection Regulation (GDPR) establishes the right to erasure (Article 17): individuals can request the deletion of their personal data. For agent memory systems, this requires the ability to:
- Identify all memories associated with a specific individual
- Delete those memories from all stores (episodic, semantic, procedural)
- Propagate deletion to derived knowledge (if derived solely from that individual's data)
- Verify deletion completeness
Event sourcing complicates erasure because the event store is append-only. The solution is crypto-shredding: each individual's PII is encrypted with a unique key. Erasure is accomplished by destroying the encryption key, rendering the PII unrecoverable even though the encrypted data remains in the event store.
Storage: [Event] -> [Encrypted PII] -> stored with key_id reference
Erasure: DELETE FROM encryption_keys WHERE individual_id = ?
Result: PII becomes irrecoverable; event structure preserved for audit trail
8.5 Memory Audit Trail
All memory operations are logged to an immutable audit trail:
interface MemoryAuditEntry { timestamp: number; agent_id: string; operation: 'read' | 'write' | 'update' | 'delete' | 'search'; memory_type: 'episodic' | 'semantic' | 'procedural' | 'working'; memory_ids: string[]; query?: string; // For search operations result_count?: number; access_justification: string; // Why the agent needed this memory policy_evaluation: { allowed: boolean; policy_id: string; denied_reason?: string; }; }
9. Benchmarks and Performance Analysis
9.1 Latency Benchmarks
Latency is the critical performance metric for agent memory because it directly impacts the agent's response time and throughput. We benchmark each storage tier under realistic workloads.
Table 4: Latency Benchmarks by Storage Tier
| Operation | Redis (Working) | Qdrant (Vector) | PostgreSQL (Event) |
|---|---|---|---|
| Single key read | 0.2ms | N/A | 2ms |
| Single key write | 0.3ms | N/A | 3ms |
| Vector search (k=10, 100K vectors) | N/A | 5ms | N/A |
| Vector search (k=10, 1M vectors) | N/A | 12ms | N/A |
| Vector search (k=10, 10M vectors) | N/A | 28ms | N/A |
| Vector search + filter (1M vectors) | N/A | 15ms | N/A |
| Event append | N/A | N/A | 4ms |
| Event query (time range, 1M events) | N/A | N/A | 25ms |
| Event query (aggregate, 1M events) | N/A | N/A | 45ms |
| Snapshot read | N/A | N/A | 8ms |
| Full memory retrieval pipeline (end-to-end) | N/A | N/A | N/A |
| -- Cache hit | 0.5ms | - | - |
| -- Cache miss, vector search | - | 15ms | - |
| -- Cache miss, vector + event enrichment | - | 15ms | 25ms |
| -- Total (typical) | - | - | 35-50ms |
All latencies measured at p50 on:
- 3-node Qdrant cluster (4 vCPU, 16GB RAM each)
- 3-node PostgreSQL (2 vCPU, 8GB RAM, primary + 2 replicas)
- Redis 7 (2 vCPU, 4GB RAM, single instance per agent)
- Network: Kubernetes pod-to-pod, same availability zone
9.2 Throughput Benchmarks
| Operation | Throughput (ops/sec) | Configuration |
|---|---|---|
| Redis reads | 150,000 | Single instance, pipelining |
| Redis writes | 120,000 | Single instance, pipelining |
| Qdrant vector search | 800 | 3 replicas, 1M vectors, k=10 |
| Qdrant vector upsert | 5,000 | Batch size 100 |
| PostgreSQL event insert | 15,000 | Batch size 100, async commit |
| PostgreSQL event query | 2,000 | Time-range queries |
| Embedding generation (OpenAI) | 3,000 | text-embedding-3-small, batch |
| Embedding generation (self-hosted BGE-M3) | 500 | RTX 4090, batch size 32 |
9.3 Cost Analysis
Table 5: Cost Per Million Memories by Deployment Model
| Component | Self-Hosted (K8s) | Managed Cloud | Hybrid |
|---|---|---|---|
| Embedding generation | $0.02 (self-hosted) | $0.13 (OpenAI large) | $0.02 |
| Vector storage (Qdrant) | $0.15/month (3-node) | $0.45/month (Pinecone) | $0.15 |
| Event storage (PostgreSQL) | $0.08/month | $0.25/month (RDS) | $0.08 |
| Working memory (Redis) | $0.03/month | $0.10/month (ElastiCache) | $0.03 |
| Network/transfer | $0.01/month | $0.05/month | $0.02 |
| Total per 1M memories/month | $0.29 | $0.98 | $0.30 |
Cost per memory operation:
Write (embed + store): $0.000013 (self-hosted) to $0.000130 (cloud)
Read (search + retrieve): $0.000002 (self-hosted) to $0.000008 (cloud)
Consolidation (per episode): $0.001 to $0.003 (LLM cost for abstraction)
9.4 Scalability Characteristics
The system scales along three dimensions:
-
Vertical: Increasing RAM and CPU per node improves throughput but has diminishing returns beyond 32GB RAM per Qdrant node.
-
Horizontal: Adding Qdrant replicas increases search throughput linearly. Adding PostgreSQL read replicas increases query throughput. Redis can be clustered for shared state.
-
Sharding: Beyond 10M vectors per collection, Qdrant supports distributed sharding across nodes. This introduces shard management complexity but enables scaling to billions of vectors.
Scaling equations:
Search throughput = base_throughput * num_replicas * efficiency_factor
where efficiency_factor ~= 0.85 (overhead for coordination)
Storage capacity = num_shards * per_shard_capacity
where per_shard_capacity ~= 10M vectors (recommended max)
Write throughput = base_write_throughput * (1 / replication_factor)
(writes must propagate to all replicas)
10. Future Directions
10.1 Neuromorphic Memory Architectures
Current vector-based memory systems are a crude approximation of biological memory. Emerging neuromorphic computing architectures (Intel Loihi 2, IBM NorthPole) offer hardware-level support for associative memory, content-addressable storage, and spike-timing-dependent plasticity. These architectures could enable agent memory systems that learn and consolidate at hardware speed, eliminating the latency and energy costs of software-based embedding and search.
10.2 Continual Learning Without Catastrophic Forgetting
A persistent challenge in agent learning is catastrophic forgetting: when learning new information overwrites previously learned knowledge. Current approaches (experience replay, elastic weight consolidation, progressive neural networks) address this partially. The memory architecture described in this paper provides an external solution -- by storing knowledge outside the model weights, the agent can learn continuously without risking forgetting. The integration of external memory with in-context learning represents a promising frontier.
10.3 Memory-Augmented Reasoning
Chain-of-thought reasoning and tree-of-thought search can be enhanced by memory-augmented retrieval at each reasoning step. Rather than reasoning purely from the current context, the agent retrieves relevant memories at each step to inform the next. This transforms reasoning from a context-limited process into a knowledge-grounded process.
10.4 Cross-Modal Memory
Current agent memory systems are primarily text-based. Extending memory to include visual observations (screenshots, diagrams), audio (conversations, alerts), and structured data (metrics, logs) requires multi-modal embedding models and cross-modal retrieval. Models like CLIP and ImageBind demonstrate that unified embedding spaces across modalities are achievable.
10. References
-
Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In K. W. Spence & J. T. Spence (Eds.), The Psychology of Learning and Motivation (Vol. 2, pp. 89-195). Academic Press. DOI:10.1016/S0079-7421(08)60422-3
-
Baddeley, A. D. (1974). Working memory. In G. H. Bower (Ed.), The Psychology of Learning and Motivation (Vol. 8, pp. 47-89). Academic Press. DOI:10.1016/S0079-7421(08)60452-1
-
Baddeley, A. D. (2000). The episodic buffer: a new component of working memory? Trends in Cognitive Sciences, 4(11), 417-423. DOI:10.1016/S1364-6613(00)01538-2
-
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., ... & Sifre, L. (2022). Improving language models by retrieving from trillions of tokens. Proceedings of the 39th International Conference on Machine Learning (ICML). arXiv:2112.04426
-
Ebbinghaus, H. (1885). Uber das Gedachtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot. English Translation (1913)
-
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv:1410.5401
-
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., ... & Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476. DOI:10.1038/nature20101
-
Hayes-Roth, B. (1985). A blackboard architecture for control. Artificial Intelligence, 26(3), 251-321. DOI:10.1016/0004-3702(85)90063-3
-
Johnson, J., Douze, M., & Jegou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547. arXiv:1702.08734 | GitHub (FAISS)
-
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2004.04906
-
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NeurIPS), 33, 9459-9474. arXiv:2005.11401
-
Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836. arXiv:1603.09320 | DOI:10.1109/TPAMI.2018.2889473
-
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81-97. DOI:10.1037/h0043158 | PDF
-
Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). arXiv:2304.03442 | DOI:10.1145/3586183.3606763
-
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., ... & Gao, J. (2023). Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813
-
Russell, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson. ISBN: 978-0134610993. Publisher
-
Shapiro, M., Preguica, N., Baquero, C., & Zawirski, M. (2011). Conflict-free replicated data types. Proceedings of the 13th International Conference on Stabilization, Safety, and Security of Distributed Systems. DOI:10.1007/978-3-642-24550-3_29 | HAL
-
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 36. arXiv:2303.11366
-
Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. Advances in Neural Information Processing Systems (NeurIPS), 28. arXiv:1503.08895
-
Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of Memory (pp. 381-403). Academic Press. Semantic Scholar
-
Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1), 1-12. DOI:10.1037/h0080017
-
Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. R. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345. arXiv:2308.11432
-
Wayne, G., Hung, C. C., Amos, D., Mirza, M., Ahuja, A., Grabska-Barwinska, A., ... & Lillicrap, T. (2018). Unsupervised predictive memory in a goal-directed agent. arXiv:1803.10760
-
Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1410.3916
-
Weaviate (2024). Vector database benchmarks. weaviate.io | GitHub
-
Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., ... & Gui, T. (2023). The rise and potential of large language model based agents: A survey. arXiv:2309.07864
-
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. Proceedings of the International Conference on Learning Representations (ICLR). arXiv:2210.03629
-
Zhong, W., Guo, L., Gao, Q., Ye, H., & Wang, Y. (2024). MemoryBank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19724-19731. arXiv:2305.10250
Appendix A: Glossary
| Term | Definition |
|---|---|
| ANN | Approximate Nearest Neighbor: sublinear search for similar vectors |
| CQRS | Command Query Responsibility Segregation: separate read/write models |
| CRDT | Conflict-free Replicated Data Type: eventually consistent distributed data structure |
| HNSW | Hierarchical Navigable Small World: graph-based ANN algorithm |
| LWW | Last-Writer-Wins: conflict resolution strategy using timestamps |
| MRR | Mean Reciprocal Rank: retrieval quality metric |
| nDCG | Normalized Discounted Cumulative Gain: graded relevance metric |
| PII | Personally Identifiable Information |
| PVC | Persistent Volume Claim: Kubernetes storage abstraction |
| RAG | Retrieval-Augmented Generation: injecting retrieved context into LLM prompts |
| RBAC | Role-Based Access Control |
Appendix B: Collection Configuration for Qdrant
{ "collection_name": "agent_episodic_memory", "vectors": { "size": 3072, "distance": "Cosine", "on_disk": false, "hnsw_config": { "m": 32, "ef_construct": 256, "full_scan_threshold": 10000 }, "quantization_config": { "scalar": { "type": "int8", "quantile": 0.99, "always_ram": true } } }, "optimizers_config": { "memmap_threshold": 20000, "indexing_threshold": 20000, "flush_interval_sec": 5 }, "replication_factor": 2, "write_consistency_factor": 1, "shard_number": 3 }
Appendix C: Event Store Schema (PostgreSQL)
CREATE TABLE agent_events ( event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), stream_id UUID NOT NULL, agent_id VARCHAR(64) NOT NULL, event_type VARCHAR(128) NOT NULL, version BIGINT NOT NULL, timestamp_ms BIGINT NOT NULL DEFAULT (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT, payload JSONB NOT NULL, metadata JSONB NOT NULL DEFAULT '{}', embedding_id UUID, -- Reference to vector in Qdrant CONSTRAINT unique_stream_version UNIQUE (stream_id, version) ); CREATE INDEX idx_events_agent_time ON agent_events (agent_id, timestamp_ms DESC); CREATE INDEX idx_events_stream ON agent_events (stream_id, version ASC); CREATE INDEX idx_events_type ON agent_events (event_type); CREATE INDEX idx_events_payload ON agent_events USING GIN (payload jsonb_path_ops); CREATE TABLE agent_snapshots ( snapshot_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), stream_id UUID NOT NULL, agent_id VARCHAR(64) NOT NULL, version BIGINT NOT NULL, timestamp_ms BIGINT NOT NULL, state JSONB NOT NULL, CONSTRAINT unique_snapshot_version UNIQUE (stream_id, version) ); CREATE INDEX idx_snapshots_stream ON agent_snapshots (stream_id, version DESC);
End of Whitepaper 03 BlueFly.io Agent Platform Series Copyright 2026 BlueFly.io. All rights reserved.