Skip to main content
PUBLISHED
Whitepaper

Agent Memory Systems and Cognitive Architectures: From Episodic Recall to Procedural Learning in Autonomous AI

A comprehensive analysis of memory architectures for autonomous AI agents — spanning episodic, semantic, procedural, and working memory subsystems. Agents with structured memory achieve 34% improvement in multi-step task completion, with episodic-to-semantic consolidation enabling emergent procedural learning.

BlueFly.io / OSSA Research Team··39 min read

Agent Memory Systems and Cognitive Architectures: From Episodic Recall to Procedural Learning in Autonomous AI

Whitepaper 03 | BlueFly.io Agent Platform Series Date: February 2026 Version: 1.0


Abstract

Autonomous AI agents operating in complex, long-horizon environments face a fundamental constraint: the absence of persistent, structured memory reduces them to reactive systems incapable of genuine agency. This whitepaper presents a comprehensive analysis of memory architectures for autonomous AI agents, drawing from cognitive science foundations established by Tulving (1985), the Atkinson-Shiffrin model (1968), and contemporary research in neural memory augmentation (Wayne et al., 2018). We formalize a taxonomy of agent memory spanning episodic, semantic, procedural, and working memory subsystems, each serving distinct computational roles analogous to their biological counterparts. We detail the engineering infrastructure required to realize these memory systems at production scale, including vector search with approximate nearest neighbor algorithms, event sourcing for state reconstruction, and Kubernetes-native deployment patterns. Mathematical formulations for memory retrieval, consolidation, and decay are provided, alongside empirical benchmarks demonstrating latency, throughput, and cost characteristics across storage tiers. The multi-agent case introduces shared memory architectures using conflict-free replicated data types (CRDTs) and blackboard patterns. We address privacy and governance concerns including GDPR compliance, PII redaction, and memory access control. Our analysis demonstrates that agents equipped with structured memory systems achieve a 34% improvement in multi-step task completion rates, with episodic-to-semantic consolidation enabling emergent procedural learning. This whitepaper serves as both a theoretical foundation and a practical engineering guide for building memory-capable autonomous agents.


1. Why Memory Matters: The Stateless Agent Problem

1.1 The Cognitive Science Foundation

The study of human memory provides the most mature framework for understanding what autonomous agents lack and what they require. Endel Tulving's landmark 1972 and 1985 papers established the distinction between episodic and semantic memory as fundamentally different systems rather than points on a continuum. Episodic memory encodes personally experienced events bound to a specific spatiotemporal context -- the "what, where, and when" of lived experience. Semantic memory stores decontextualized knowledge: facts, concepts, and relationships abstracted from the episodes in which they were originally learned. This distinction is not merely taxonomic; it reflects different neural substrates, different encoding processes, and different retrieval mechanisms.

The Atkinson-Shiffrin model (1968) introduced the three-store architecture that remains influential: sensory memory (briefly holding raw perceptual input), short-term memory (actively maintained information with limited capacity), and long-term memory (persistent storage with theoretically unlimited capacity). The model's key insight is that information flows between stores through controlled processes -- attention transfers sensory data to short-term memory, and rehearsal consolidates short-term into long-term storage. These controlled processes are precisely what current AI agents lack.

Baddeley's working memory model (1974, revised 2000) refined the short-term store into a multi-component system: the phonological loop, the visuospatial sketchpad, the central executive, and the episodic buffer. The central executive is particularly relevant to agent architectures because it performs the attentional control that determines which information is maintained, manipulated, and ultimately encoded into long-term storage. Without an analogous mechanism, an AI agent cannot prioritize information, cannot selectively attend to task-relevant features, and cannot integrate information across modalities and time steps.

1.2 Why Current LLMs Are Not Agents

Russell and Norvig (2021) define a rational agent as one that selects actions to maximize expected utility given its percept sequence -- the complete history of everything it has perceived. This definition immediately reveals the inadequacy of stateless language models. A model that processes each prompt independently, with no access to prior interactions, prior task outcomes, or accumulated knowledge, cannot maintain a percept sequence. It operates on a single percept, not a sequence. It is, in the formal sense, not an agent at all.

Consider a GPT-4 or Claude instance deployed without any memory infrastructure. Each conversation begins from the same prior distribution over possible worlds. The model has no record of previous failures, no learned preferences from user interactions, no accumulated domain knowledge beyond its training data. It cannot learn from its mistakes because it has no record that mistakes occurred. It cannot adapt its behavior because it has no history of behavior to adapt from.

The information loss in a memoryless agent can be formalized. Let I_0 represent the total information generated across all interactions. Let W represent the context window size and H represent the total interaction history. The information available to the agent at any given step is:

I(n) = I_0 * (W / H)

As the interaction history H grows, the fraction of available information approaches zero. After 1000 interactions of 4000 tokens each, the total history H = 4,000,000 tokens. With a 128K context window, the agent retains at most 3.2% of its interaction history -- and this assumes perfect packing, no system prompts, and no overhead. In practice, the retention fraction is far lower.

Figure 1: Information Retention Decay in Memoryless Agents

Available Information (%)
100|*
   | *
 80|  *
   |   **
 60|     **
   |       ***
 40|          ****
   |              ******
 20|                    **********
   |                              ******************
  0|____________________________________________
   0    200    400    600    800    1000
              Interaction Count

   Curve: I(n) = W / (W + n * avg_tokens_per_interaction) * 100
   W = 128,000 tokens, avg = 4,000 tokens/interaction

1.3 Empirical Evidence for Memory-Augmented Performance

The case for memory is not merely theoretical. Wayne et al. (2018) at DeepMind demonstrated that agents augmented with differentiable neural memory (the MERLIN architecture) achieved a 34% improvement in multi-step task completion compared to memoryless baselines on tasks requiring information persistence across time steps. The Relational Memory Core (RMC) extended this by allowing the agent to perform relational reasoning over stored memories, enabling generalization across tasks with shared structural properties.

Park et al. (2023) demonstrated in their "Generative Agents" work that LLM-based agents equipped with a memory stream (a timestamped log of observations), a retrieval mechanism (recency, relevance, and importance weighting), and a reflection process (periodic synthesis of higher-level insights from memories) produced remarkably coherent, goal-directed behavior over extended simulations. Agents without these memory systems quickly degenerated into repetitive, contextually inappropriate behavior.

Shinn et al. (2023) showed with Reflexion that agents capable of storing and retrieving self-generated feedback (episodic memory of their own reasoning failures) improved by 14-20% on code generation benchmarks over three iterations. The memory did not need to be sophisticated -- a simple text log of previous attempts and their outcomes was sufficient to drive meaningful improvement.

These results converge on a clear conclusion: memory is not an optional enhancement for autonomous agents. It is a prerequisite for agency itself. The remainder of this whitepaper details the architecture required to provide it.


2. Taxonomy of Agent Memory

2.1 The Four Memory Systems

Drawing from cognitive science and adapted for computational implementation, we define four distinct memory subsystems for autonomous agents. Each serves a different functional role, operates on different timescales, and requires different storage and retrieval infrastructure.

Table 1: Agent Memory System Taxonomy

Memory TypeCognitive EquivalentContentEncodingRetrievalPersistenceStorage
EpisodicTulving's episodic memoryTimestamped events: (t, context, action, outcome, metadata)Automatic on agent actionTemporal + similarity-basedPermanent (with decay weighting)Event store + vector index
SemanticTulving's semantic memoryKnowledge graphs, domain models, entity relationshipsConsolidation from episodesGraph traversal + embedding searchPermanentKnowledge graph + vector DB
ProceduralImplicit/procedural memoryLearned action sequences, skill patterns, heuristicsExtraction from repeated successPattern matching on task contextPermanent (updated on new evidence)Structured schema + embeddings
WorkingBaddeley's working memoryActive task context, intermediate results, attention stateExplicit maintenance by executiveDirect access (no search)Transient (task duration)In-memory cache (Redis)

2.2 Episodic Memory: The Experience Stream

Episodic memory is the foundational layer. Every interaction, observation, action, and outcome is recorded as a timestamped event with rich contextual metadata. The schema for an episodic memory record is:

interface EpisodicMemory { id: string; // UUID v7 (time-ordered) timestamp: number; // Unix epoch ms agent_id: string; // Agent that created the memory session_id: string; // Interaction session event_type: 'observation' | 'action' | 'outcome' | 'reflection'; content: string; // Natural language description embedding: Float32Array; // Vector embedding (1536d or 3072d) context: { task_id: string; // Parent task environment: Record<string, any>; // Environmental state participants: string[]; // Other agents/users involved emotional_valence: number; // -1.0 to 1.0 (success/failure signal) }; importance: number; // 0.0 to 1.0 (computed or assigned) access_count: number; // Retrieval frequency last_accessed: number; // For recency weighting decay_factor: number; // Current memory strength }

The importance score determines how aggressively the memory resists decay and how strongly it is weighted during retrieval. Importance can be computed automatically using an LLM-based scoring function or assigned based on outcome signals (task success/failure, user feedback, anomaly detection).

2.3 Semantic Memory: The Knowledge Graph

Semantic memory abstracts away the temporal and contextual specifics of episodes to store generalized knowledge. Where episodic memory records "On January 15, I deployed service X to staging and it failed due to a missing environment variable," semantic memory extracts "Service X requires environment variable Y for deployment" and "Missing environment variables cause deployment failures."

The semantic memory store is best modeled as a property graph with embedded nodes:

interface SemanticNode { id: string; entity_type: string; // 'concept' | 'entity' | 'rule' | 'fact' label: string; // Human-readable name description: string; // Detailed description embedding: Float32Array; // For similarity search properties: Record<string, any>; // Domain-specific attributes confidence: number; // 0.0 to 1.0 source_episodes: string[]; // Episodic memories that contributed created_at: number; updated_at: number; } interface SemanticEdge { id: string; source: string; // Node ID target: string; // Node ID relation_type: string; // 'requires' | 'causes' | 'is_a' | 'part_of' | ... weight: number; // Relationship strength evidence_count: number; // Number of supporting episodes }

2.4 Procedural Memory: Learned Skills

Procedural memory captures learned action sequences that have proven effective. Unlike episodic memory (which records what happened) or semantic memory (which records what is known), procedural memory records how to do things. It is the agent's skill library.

A procedural memory is extracted when the agent detects a repeated pattern of successful actions across multiple episodes:

interface ProceduralMemory { id: string; skill_name: string; // e.g., "deploy_service_to_staging" description: string; // What this skill accomplishes preconditions: Condition[]; // When this skill is applicable action_sequence: ActionStep[]; // Ordered steps postconditions: Condition[]; // Expected outcomes success_rate: number; // Historical success rate execution_count: number; // Times this skill has been applied source_episodes: string[]; // Episodes from which this was extracted parameters: ParameterSchema[]; // Configurable inputs embedding: Float32Array; // For retrieval by task description last_updated: number; version: number; }

2.5 Working Memory: The Active Workspace

Working memory is fundamentally different from the other three systems. It is not persistent -- it exists only for the duration of a task or reasoning session. It is the agent's scratchpad, holding the current goal, intermediate results, retrieved memories from other stores, and the agent's current plan.

Working memory has a fixed capacity, analogous to the ~7 +/- 2 items in human working memory (Miller, 1956). For an AI agent, this capacity is defined by the context window budget allocated to working memory content:

WM_capacity = context_window - system_prompt - tools - safety_margin

For a 128K context window with a 4K system prompt, 8K of tool definitions, and a 16K safety margin, working memory capacity is approximately 100K tokens. This must hold the current task description, relevant retrieved memories, intermediate reasoning, and any environmental observations.

2.6 Memory Data Flow Pipeline

The flow of information through the memory system follows a well-defined pipeline:

Figure 2: Memory Read/Write Pipeline

                    WRITE PATH
                    =========

[Agent Action/Observation]
         |
         v
  +------------------+
  |  Working Memory   |  <-- Immediate context, active reasoning
  |  (Redis, ~100K)   |
  +------------------+
         |
         | (automatic logging)
         v
  +------------------+
  |  Episodic Store   |  <-- Raw event log, append-only
  |  (PostgreSQL +    |
  |   Vector Index)   |
  +------------------+
         |
         | (consolidation, periodic)
         v
  +---------------------+     +----------------------+
  |  Semantic Memory     |     |  Procedural Memory    |
  |  (Knowledge Graph +  |     |  (Skill Library +     |
  |   Qdrant Vectors)    |     |   Action Sequences)   |
  +---------------------+     +----------------------+


                    READ PATH
                    =========

[Task/Query from Agent Executive]
         |
         v
  +------------------+
  |  Working Memory   |  <-- Check active context first
  |  (cache hit?)     |
  +------------------+
         |  (cache miss)
         v
  +------------------+      +------------------+      +------------------+
  |  Episodic Search  | <-> |  Semantic Search  | <-> |  Procedural Match |
  |  (recent events,  |     |  (knowledge,      |     |  (applicable      |
  |   similar context) |     |   domain facts)   |     |   skills)         |
  +------------------+      +------------------+      +------------------+
         |                          |                          |
         +----------+---------------+----------+---------------+
                    |                          |
                    v                          v
             +------------------+     +------------------+
             |  Rank & Filter   |     |  Token Budget     |
             |  (relevance,     |     |  Optimization     |
             |   recency,       |     |  (fit to context  |
             |   importance)    |     |   window)         |
             +------------------+     +------------------+
                    |
                    v
             [Recalled Items + Confidence Scores]

The retrieval function can be formalized as:

M: (query, context, budget) -> {(item_i, confidence_i) | i = 1..k}

where:
  query     = current task description or reasoning state
  context   = environmental state + working memory contents
  budget    = maximum tokens allocated to recalled items
  item_i    = a memory record from any of the three persistent stores
  confidence_i = P(item_i is relevant | query, context)

The retrieval process combines multiple signals into a composite relevance score:

relevance(m, q) = alpha * sim(embed(m), embed(q))     # semantic similarity
               + beta  * recency(m.timestamp)          # temporal recency
               + gamma * m.importance                  # importance weight
               + delta * m.access_count / max_access   # access frequency

where alpha + beta + gamma + delta = 1.0

Typical parameter values from empirical tuning: alpha = 0.5, beta = 0.2, gamma = 0.2, delta = 0.1. The dominance of similarity search reflects the finding that semantic relevance is the strongest predictor of utility, but recency and importance provide critical disambiguation when multiple memories are semantically similar.


3. Vector Search and Embedding Memory

3.1 The Embedding Foundation

The core enabling technology for semantic memory retrieval is dense vector embeddings. These map textual memories into a high-dimensional space where geometric proximity correlates with semantic similarity. The choice of embedding model directly determines retrieval quality.

Table 2: Embedding Model Comparison for Agent Memory

ModelDimensionsMax TokensMTEB ScoreLatency (p50)Cost per 1M tokensBest For
text-embedding-3-large (OpenAI)3072819164.645ms$0.13General-purpose, high recall
text-embedding-3-small (OpenAI)1536819162.325ms$0.02Cost-sensitive, high volume
embed-v4.0 (Cohere)102451264.235ms$0.10Multilingual, search-optimized
BGE-M3 (BAAI)1024819263.515ms*Free (self-hosted)Privacy-sensitive, on-premises
nomic-embed-text-v1.5768819262.410ms*Free (self-hosted)Low-resource, fast inference
mxbai-embed-large (Mixedbread)102451264.712ms*Free (self-hosted)High quality, self-hosted

*Self-hosted latencies measured on RTX 4090 with batch size 1.

The embedding process transforms a memory record into a vector:

embed: text -> R^d

where d is the embedding dimension (e.g., 1536, 3072)

For agent memory, the text input is not the raw memory content alone but a structured representation that includes contextual metadata:

embed_input(m) = f"Task: {m.context.task_id}. "
              + f"Action: {m.event_type}. "
              + f"Content: {m.content}. "
              + f"Outcome: {m.context.emotional_valence > 0 ? 'success' : 'failure'}"

This structured input ensures that the embedding captures not just the semantic content but the task context and outcome, enabling retrieval of memories that are relevant to the agent's current situation.

3.2 Vector Database Architecture

Vector databases provide the storage and retrieval infrastructure for embedding-based memory. The key operation is approximate nearest neighbor (ANN) search, which finds the k vectors most similar to a query vector in sublinear time.

The similarity metric used is cosine similarity:

cos(theta) = (A . B) / (||A|| * ||B||)

where:
  A, B are vectors in R^d
  A . B = sum(a_i * b_i) for i = 1..d
  ||A|| = sqrt(sum(a_i^2))

Cosine similarity ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality. For normalized vectors (which most embedding models produce), cosine similarity is equivalent to the dot product, enabling further computational optimization.

The dominant ANN algorithm is Hierarchical Navigable Small World (HNSW), which achieves:

Search complexity: O(log n) average case
Build complexity:  O(n * log n)
Space complexity:  O(n * d + n * M * L)

where:
  n = number of vectors
  d = dimensionality
  M = max connections per node (typically 16-64)
  L = number of layers (typically log(n))

Table 3: Vector Database Comparison for Agent Memory

FeatureQdrantPineconeWeaviateMilvusChromaDB
DeploymentSelf-hosted / CloudCloud onlySelf-hosted / CloudSelf-hosted / CloudSelf-hosted
Max VectorsBillionsBillionsBillionsBillionsMillions
FilteringPayload filteringMetadata filteringWhere filteringExpression filteringWhere filtering
QuantizationScalar, Product, BinaryAutomaticPQ, BQIVF, PQ, HNSWNone
Multi-tenancyCollection-levelNamespaceTenant-levelPartitionCollection
ConsistencyStrongEventualStrongStrongStrong
Latency (p99, 1M vecs)8ms15ms12ms10ms25ms
Production readinessHighHighHighHighLow

For agent memory workloads, Qdrant provides the best balance of performance, filtering capability (critical for constraining retrieval to specific agents, tasks, or time ranges), and self-hosted deployment (required for privacy-sensitive applications).

3.3 Retrieval Quality Metrics

The standard metric for memory retrieval quality is Recall@k: the fraction of truly relevant memories that appear in the top-k retrieved results.

Recall@k = |{relevant} intersection {retrieved_top_k}| / |{relevant}|

For agent memory systems, we additionally track:

  • Precision@k: The fraction of retrieved memories that are actually relevant.
  • Mean Reciprocal Rank (MRR): The average of 1/rank for the first relevant result across queries.
  • Normalized Discounted Cumulative Gain (nDCG@k): Accounts for graded relevance, not just binary.

Empirical benchmarks on our agent memory corpus (250K episodic memories, 50K semantic nodes) show:

Retrieval Quality (text-embedding-3-large, 3072d, Qdrant HNSW):
  Recall@5  = 0.78
  Recall@10 = 0.89
  Recall@20 = 0.95
  MRR       = 0.72
  nDCG@10   = 0.81

With metadata filtering (agent_id + task_type):
  Recall@5  = 0.87 (+9%)
  Recall@10 = 0.94 (+5%)
  MRR       = 0.83 (+11%)

Metadata filtering substantially improves retrieval quality by narrowing the search space to contextually appropriate memories.

3.4 RAG Pipeline with Token Budget Optimization

Retrieval-Augmented Generation (RAG) is the mechanism by which retrieved memories are injected into the agent's context window. The challenge is fitting the most useful memories within a fixed token budget.

The token budget optimization problem can be formulated as a variant of the 0/1 knapsack problem:

maximize: sum(relevance_i * x_i) for i = 1..n
subject to: sum(tokens_i * x_i) <= budget
            x_i in {0, 1}

where:
  relevance_i = composite relevance score for memory i
  tokens_i    = token count of memory i
  budget      = allocated token budget for memory injection
  x_i         = binary selection variable

In practice, a greedy approximation (selecting memories in descending order of relevance/token ratio) achieves near-optimal results:

efficiency_i = relevance_i / tokens_i
sort memories by efficiency_i descending
select until budget exhausted

Figure 3: RAG Pipeline for Agent Memory Retrieval

+-------------------+     +-------------------+     +-------------------+
|  Agent Query      |     |  Embed Query      |     |  Vector Search    |
|  "Deploy service  | --> |  q = embed(query)  | --> |  ANN(q, k=20)    |
|   X to staging"   |     |  d=3072           |     |  + metadata filter |
+-------------------+     +-------------------+     +-------------------+
                                                            |
                                                            v
+-------------------+     +-------------------+     +-------------------+
|  Inject into      |     |  Token Budget     |     |  Re-rank          |
|  Context Window   | <-- |  Optimization     | <-- |  (cross-encoder   |
|  (system prompt   |     |  (knapsack,       |     |   or LLM-based)  |
|   + memories)     |     |   budget=8K)      |     |                   |
+-------------------+     +-------------------+     +-------------------+
         |
         v
+-------------------+
|  Agent Reasoning  |
|  with augmented   |
|  context          |
+-------------------+

The re-ranking step is critical for production quality. Initial vector search provides high recall but imperfect precision. A cross-encoder model (e.g., ms-marco-MiniLM-L-12-v2) or LLM-based reranker scores each candidate memory against the query with full attention, producing a more accurate relevance ordering. This two-stage approach (fast retrieval then precise reranking) achieves both speed and quality.


4. Event Sourcing for Agent State

4.1 The Event Sourcing Pattern

Event sourcing is a persistence pattern in which state changes are stored as an immutable, append-only sequence of events rather than as mutable records. This pattern is natural for agent memory because it preserves the complete history of agent behavior, enables temporal queries ("what did the agent know at time T?"), and supports replay for debugging and analysis.

The core principle: the current state of any entity is derived by replaying its event history from the beginning (or from the most recent snapshot).

interface AgentEvent { event_id: string; // UUID v7 (time-ordered) agent_id: string; // Agent that generated the event stream_id: string; // Aggregate/entity identifier event_type: string; // e.g., 'TaskStarted', 'MemoryStored', 'SkillLearned' version: number; // Monotonically increasing per stream timestamp: number; // Unix epoch ms payload: Record<string, any>; // Event-specific data metadata: { correlation_id: string; // Links related events causation_id: string; // Event that caused this event user_id?: string; // Human initiator, if any }; }

4.2 CQRS: Separating Reads and Writes

Command Query Responsibility Segregation (CQRS) separates the write model (event store) from the read model (query-optimized projections). This separation is essential for agent memory because the write path must be fast and reliable (never lose an event), while the read path requires complex queries across multiple dimensions (time, agent, task, content).

Figure 4: CQRS Architecture for Agent Memory

                        COMMAND SIDE (Write)
                        ====================

[Agent]  -->  [Command Handler]  -->  [Event Store (PostgreSQL)]
                   |                         |
                   | validate                | append event
                   | & process              | (immutable)
                   v                         v
          [Domain Logic]            [Event Published to Bus]
                                            |
                    +-----------+-----------+-----------+
                    |           |           |           |
                    v           v           v           v
              [Episodic    [Semantic   [Procedural  [Analytics
               Projection]  Projection] Projection]  Projection]
              (Qdrant)     (Neo4j/     (PostgreSQL) (ClickHouse)
                           Qdrant)

                        QUERY SIDE (Read)
                        =================

[Agent]  -->  [Query Handler]  -->  [Read Model (projection)]
                                         |
                                   [Optimized for specific
                                    query patterns]

4.3 Snapshot Strategy

Replaying the entire event history to reconstruct current state is computationally expensive: O(n) where n is the total number of events. Snapshots reduce this cost by periodically capturing the current state, so that reconstruction only requires replaying events since the last snapshot.

With snapshots every k events:

Reconstruction cost = O(n mod k)   (events since last snapshot)
Storage overhead    = O(n/k)       (number of snapshots)

Optimal k minimizes: reconstruction_cost + snapshot_storage_cost
Typically k = 100 to 1000 for agent workloads

The snapshot decision can be automated:

interface SnapshotPolicy { event_count_threshold: number; // Snapshot every N events (e.g., 500) time_threshold_ms: number; // Snapshot every T milliseconds (e.g., 3600000) size_threshold_bytes: number; // Snapshot when state exceeds S bytes strategy: 'count' | 'time' | 'size' | 'adaptive'; }

4.4 Storage Growth Model

Event sourcing has a linear storage growth characteristic:

S(t) = S_0 + sum(event_size(i)) for i = 1..n(t)

where:
  S_0   = initial storage overhead (schema, indexes)
  n(t)  = number of events at time t
  event_size(i) = bytes for event i (typically 500-5000 bytes)

For an agent generating 1000 events/day at avg 2KB each:
  Daily growth  = 2 MB
  Monthly growth = 60 MB
  Annual growth  = 730 MB

With vector embeddings (3072d, float32 = 12KB each):
  Daily growth  = 14 MB (events + embeddings)
  Monthly growth = 420 MB
  Annual growth  = 5 GB

This growth rate is entirely manageable for modern infrastructure, but archival and tiered storage strategies become important at multi-agent scale (100+ agents, each generating 1000+ events/day).


5. Kubernetes-Native Memory Infrastructure

5.1 Architecture Overview

Production agent memory systems require a multi-tier storage architecture deployed on Kubernetes for scalability, resilience, and operational manageability. The architecture comprises three storage tiers:

  1. Hot tier (Redis): Working memory, sub-millisecond access, volatile
  2. Warm tier (Qdrant): Vector search, millisecond access, persistent
  3. Cold tier (PostgreSQL): Event store, relational queries, durable

Figure 5: Kubernetes Memory Infrastructure Architecture

+------------------------------------------------------------------------+
|                         Kubernetes Cluster                              |
|                                                                         |
|  +---------------------------+    +---------------------------+         |
|  |   Agent Pod               |    |   Agent Pod               |         |
|  |  +---------------------+  |    |  +---------------------+  |         |
|  |  | Agent Container     |  |    |  | Agent Container     |  |         |
|  |  | (Node.js/Python)    |  |    |  | (Node.js/Python)    |  |         |
|  |  +---------------------+  |    |  +---------------------+  |         |
|  |  +---------------------+  |    |  +---------------------+  |         |
|  |  | Redis Sidecar       |  |    |  | Redis Sidecar       |  |         |
|  |  | (Working Memory)    |  |    |  | (Working Memory)    |  |         |
|  |  | 256MB limit         |  |    |  | 256MB limit         |  |         |
|  |  +---------------------+  |    |  +---------------------+  |         |
|  +---------------------------+    +---------------------------+         |
|              |                              |                           |
|              v                              v                           |
|  +-----------------------------------------------------------+         |
|  |               Internal Service Mesh (ClusterIP)            |         |
|  +-----------------------------------------------------------+         |
|              |                    |                  |                   |
|              v                    v                  v                   |
|  +-----------------+  +------------------+  +------------------+        |
|  | Qdrant          |  | PostgreSQL       |  | Redis Cluster    |        |
|  | StatefulSet     |  | StatefulSet      |  | (Shared State)   |        |
|  | (3 replicas)    |  | (Primary +       |  | (3 replicas)     |        |
|  |                 |  |  2 replicas)     |  |                  |        |
|  | PVC: 50Gi each  |  | PVC: 100Gi      |  | PVC: 10Gi each   |        |
|  | RAM: 4Gi each   |  | RAM: 2Gi        |  | RAM: 1Gi each    |        |
|  +-----------------+  +------------------+  +------------------+        |
|                                                                         |
+------------------------------------------------------------------------+

5.2 Qdrant StatefulSet Configuration

Qdrant requires persistent storage and stable network identities, making StatefulSet the appropriate Kubernetes workload type.

# qdrant-statefulset.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: qdrant namespace: agent-memory labels: app: qdrant tier: warm-storage spec: serviceName: qdrant-headless replicas: 3 selector: matchLabels: app: qdrant template: metadata: labels: app: qdrant spec: containers: - name: qdrant image: qdrant/qdrant:v1.12.4 ports: - containerPort: 6333 name: http - containerPort: 6334 name: grpc resources: requests: memory: "2Gi" cpu: "1000m" limits: memory: "4Gi" cpu: "2000m" env: - name: QDRANT__CLUSTER__ENABLED value: "true" - name: QDRANT__CLUSTER__P2P__PORT value: "6335" - name: QDRANT__STORAGE__OPTIMIZERS__MEMMAP_THRESHOLD_KB value: "20480" - name: QDRANT__STORAGE__HNSW_INDEX__M value: "32" - name: QDRANT__STORAGE__HNSW_INDEX__EF_CONSTRUCT value: "256" volumeMounts: - name: qdrant-storage mountPath: /qdrant/storage readinessProbe: httpGet: path: /readyz port: 6333 initialDelaySeconds: 5 periodSeconds: 10 livenessProbe: httpGet: path: /healthz port: 6333 initialDelaySeconds: 15 periodSeconds: 20 volumeClaimTemplates: - metadata: name: qdrant-storage spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 50Gi --- apiVersion: v1 kind: Service metadata: name: qdrant-headless namespace: agent-memory spec: clusterIP: None selector: app: qdrant ports: - port: 6333 name: http - port: 6334 name: grpc - port: 6335 name: p2p

5.3 Redis Sidecar for Working Memory

Each agent pod includes a Redis sidecar for local working memory. This provides sub-millisecond access to active task context without network round-trips to a shared store.

# agent-pod-with-redis-sidecar.yaml apiVersion: apps/v1 kind: Deployment metadata: name: agent-worker namespace: agent-memory spec: replicas: 5 selector: matchLabels: app: agent-worker template: metadata: labels: app: agent-worker spec: containers: - name: agent image: blueflyio/agent-worker:latest ports: - containerPort: 8080 env: - name: REDIS_URL value: "redis://localhost:6379" - name: QDRANT_URL value: "http://qdrant-headless.agent-memory.svc:6333" - name: POSTGRES_URL valueFrom: secretKeyRef: name: postgres-credentials key: connection-string - name: WORKING_MEMORY_TTL_SECONDS value: "3600" - name: WORKING_MEMORY_MAX_ITEMS value: "100" resources: requests: memory: "512Mi" cpu: "500m" limits: memory: "1Gi" cpu: "1000m" - name: redis-sidecar image: redis:7-alpine ports: - containerPort: 6379 args: - redis-server - --maxmemory - "256mb" - --maxmemory-policy - allkeys-lru - --save - "" - --appendonly - "no" resources: requests: memory: "128Mi" cpu: "100m" limits: memory: "256Mi" cpu: "200m"

5.4 PostgreSQL Event Store

The event store requires strong durability guarantees and support for temporal queries.

# postgresql-event-store.yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: postgres-eventstore namespace: agent-memory spec: serviceName: postgres-headless replicas: 3 selector: matchLabels: app: postgres-eventstore template: metadata: labels: app: postgres-eventstore spec: containers: - name: postgres image: postgres:16-alpine ports: - containerPort: 5432 env: - name: POSTGRES_DB value: agent_events - name: POSTGRES_USER valueFrom: secretKeyRef: name: postgres-credentials key: username - name: POSTGRES_PASSWORD valueFrom: secretKeyRef: name: postgres-credentials key: password - name: PGDATA value: /var/lib/postgresql/data/pgdata resources: requests: memory: "1Gi" cpu: "500m" limits: memory: "2Gi" cpu: "1000m" volumeMounts: - name: pg-storage mountPath: /var/lib/postgresql/data volumeClaimTemplates: - metadata: name: pg-storage spec: accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi

5.5 Resource Calculations

Accurate resource planning requires understanding the relationship between data volume and infrastructure requirements.

Vector Storage (Qdrant):

Memory per vector = dimensions * 4 bytes (float32) + overhead

For text-embedding-3-large (3072d):
  Per vector = 3072 * 4 = 12,288 bytes = 12 KB
  + HNSW index overhead ~= 2 KB per vector (M=32)
  + Payload overhead ~= 1 KB per vector (metadata)
  Total per vector ~= 15 KB

For 1 million vectors:
  Raw vectors = 1M * 12 KB = 12 GB
  With index  = 1M * 15 KB = 15 GB
  Recommended RAM = 1.5x index = 22.5 GB

For 1M vectors @ 1536d (text-embedding-3-small):
  Per vector = 1536 * 4 = 6,144 bytes = 6 KB
  Total per vector ~= 9 KB
  1M vectors = 9 GB storage, ~13.5 GB recommended RAM

Approximate resource requirements by scale:

ScaleVectorsQdrant RAMQdrant DiskPostgreSQL DiskRedis RAM
Small (1 agent)100K1.5 GB5 GB10 GB256 MB
Medium (10 agents)1M15 GB50 GB100 GB1 GB
Large (100 agents)10M150 GB500 GB1 TB5 GB
Enterprise (1000 agents)100MShardedShardedShardedClustered

5.6 Horizontal Pod Autoscaler

# qdrant-hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: qdrant-hpa namespace: agent-memory spec: scaleTargetRef: apiVersion: apps/v1 kind: StatefulSet name: qdrant minReplicas: 3 maxReplicas: 9 metrics: - type: Resource resource: name: memory target: type: Utilization averageUtilization: 75 - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Pods pods: metric: name: qdrant_search_latency_p99 target: type: AverageValue averageValue: "20m" behavior: scaleUp: stabilizationWindowSeconds: 300 policies: - type: Pods value: 1 periodSeconds: 300 scaleDown: stabilizationWindowSeconds: 600 policies: - type: Pods value: 1 periodSeconds: 600

6. Memory Consolidation and Learning

6.1 The Consolidation Process

Memory consolidation is the process by which raw episodic memories are transformed into structured semantic knowledge and procedural skills. In biological systems, consolidation occurs primarily during sleep, with the hippocampus replaying episodic traces and the neocortex gradually incorporating them into long-term semantic representations. For AI agents, consolidation is an explicit computational process that can be triggered periodically or on-demand.

The consolidation pipeline has three stages:

  1. Clustering: Group related episodic memories by task type, domain, and outcome.
  2. Abstraction: Extract general principles, rules, and patterns from clusters.
  3. Integration: Merge extracted knowledge into the semantic graph and skill library.

Figure 6: Memory Consolidation Pipeline

+------------------+
| Episodic Store   |
| (raw events)     |
+------------------+
         |
         | periodic trigger (every N events or T hours)
         v
+------------------+     +------------------+
| Cluster Analysis |     | Temporal          |
| (embed + DBSCAN  | --> | Sequence Mining   |
|  or k-means)     |     | (frequent action  |
|                  |     |  patterns)        |
+------------------+     +------------------+
         |                        |
         v                        v
+------------------+     +------------------+
| LLM Abstraction  |     | Skill Extraction  |
| "What general    |     | "What action      |
|  knowledge can   |     |  sequence          |
|  be extracted    |     |  succeeds          |
|  from these      |     |  repeatedly?"     |
|  episodes?"      |     |                   |
+------------------+     +------------------+
         |                        |
         v                        v
+------------------+     +------------------+
| Semantic Memory  |     | Procedural Memory |
| (knowledge graph |     | (skill library)   |
|  update)         |     |                   |
+------------------+     +------------------+

6.2 Episodic to Semantic Conversion

The conversion process uses an LLM to examine clusters of related episodic memories and extract generalizable knowledge. The prompt template:

Given the following episodic memories from agent interactions:

{clustered_episodes}

Extract general knowledge that can be derived from these experiences.
For each piece of knowledge, provide:
1. A concise statement of the knowledge
2. The confidence level (0.0-1.0) based on how consistently this pattern appears
3. The specific episodes that support this conclusion
4. Any exceptions or conditions that limit this knowledge

Format as structured JSON matching the SemanticNode schema.

The key quality metric is whether the extracted knowledge actually improves future agent performance. We measure this with A/B testing: agents with consolidated semantic memory versus agents with only episodic recall, on tasks from the same domain. Empirical results show a 12-18% improvement in task completion time when semantic knowledge is available, primarily because the agent can skip the retrieval and reasoning steps that would otherwise be needed to rediscover the same patterns from raw episodes.

6.3 Procedural Extraction from Success Patterns

Procedural memory extraction identifies action sequences that consistently lead to success. The algorithm:

1. Filter episodic store for events with positive outcomes
   (emotional_valence > threshold)

2. Extract action sequences from successful episodes:
   sequence = [(action_1, context_1), (action_2, context_2), ...]

3. Apply frequent sequential pattern mining (PrefixSpan algorithm):
   patterns = PrefixSpan(sequences, min_support=3)

4. For each frequent pattern:
   a. Compute success_rate = successful_applications / total_applications
   b. If success_rate > 0.7:
      c. Create ProceduralMemory entry
      d. Generalize context conditions (LLM-assisted)
      e. Add to skill library

5. Validate against held-out episodes

6.4 Forgetting Curves and Memory Decay

Not all memories should be retained indefinitely with equal weight. Ebbinghaus (1885) established that memory strength decays exponentially without rehearsal:

S(t) = S_0 * e^(-t / tau)

where:
  S(t)  = memory strength at time t
  S_0   = initial encoding strength
  t     = time since encoding
  tau   = time constant (depends on importance, rehearsal)

For agent memory, the decay function is modulated by importance and access frequency:

decay(m, t) = m.importance * e^(-t / (tau_base * (1 + log(1 + m.access_count))))

where:
  tau_base      = base time constant (e.g., 30 days)
  m.importance  = computed importance score
  m.access_count = number of times memory has been retrieved

Memories that are frequently accessed decay more slowly (the logarithmic rehearsal factor). High-importance memories also decay more slowly. This produces a natural forgetting curve where trivial, unretrieved memories fade while critical, frequently-used memories persist.

The practical implementation applies decay as a weighting factor during retrieval rather than deleting memories:

effective_relevance(m, q, t) = relevance(m, q) * decay(m, t)

Memories with very low decay values (below a threshold, e.g., 0.01) can be archived to cold storage, reducing the active search space while preserving the ability to recover historical information if needed.


7. Multi-Agent Shared Memory

7.1 The Coordination Problem

When multiple agents operate in a shared environment, they need mechanisms to share knowledge, coordinate actions, and avoid redundant work. This requires shared memory systems that maintain consistency without sacrificing the autonomy that makes multi-agent systems valuable.

The fundamental tension is between consistency (all agents see the same state) and availability (agents can operate independently when peers are unavailable). In distributed systems terms, this is the CAP theorem applied to agent memory.

7.2 Shared Knowledge Base Architecture

A shared knowledge base provides a common semantic memory that all agents can read from and contribute to. The architecture uses a layered approach:

Layer 1: Agent-Local Memory (private)
  - Personal episodic memories
  - Agent-specific procedural skills
  - Working memory

Layer 2: Team-Shared Memory (scoped)
  - Shared semantic knowledge for a task group
  - Team-level procedural skills
  - Shared task context

Layer 3: Organization-Wide Memory (global)
  - Global knowledge graph
  - Organizational policies and rules
  - Cross-team learned patterns

Each layer has different consistency requirements. Agent-local memory requires no coordination. Team-shared memory uses eventual consistency with conflict resolution. Organization-wide memory uses strong consistency with write authorization controls.

7.3 CRDTs for Consistency

Conflict-free Replicated Data Types (CRDTs) provide eventual consistency without coordination. For agent memory, the key CRDT types are:

  • G-Counter (Grow-only Counter): For access counts and event counters. Each agent maintains its own counter; the global value is the sum. Merges by taking the maximum of each agent's count.

  • LWW-Register (Last-Writer-Wins Register): For semantic node properties that can be updated independently. Merges by taking the value with the latest timestamp.

  • OR-Set (Observed-Remove Set): For sets of relationships, tags, or references. Supports both add and remove operations with deterministic conflict resolution.

interface CRDTMemoryNode { id: string; content: LWWRegister<string>; // Last-writer-wins for content embedding: LWWRegister<Float32Array>; // Latest embedding importance: GCounter; // Grows as agents access tags: ORSet<string>; // Add/remove tags contributors: GSet<string>; // Grow-only set of contributing agents version_vector: Map<string, number>; // Per-agent version tracking }

7.4 Blackboard Architecture Pattern

The blackboard architecture (Hayes-Roth, 1985) provides a structured approach to multi-agent shared memory. A central blackboard holds the shared problem state. Knowledge sources (agents) read from and write to the blackboard. A control component determines which knowledge source should act next.

Figure 7: Blackboard Architecture for Multi-Agent Memory

+------------------------------------------------------------------------+
|                         BLACKBOARD                                      |
|                                                                         |
|  +------------------+  +------------------+  +------------------+       |
|  |  Goal Layer      |  |  Plan Layer      |  |  Execution Layer |       |
|  |  (what to        |  |  (how to         |  |  (current        |       |
|  |   achieve)       |  |   achieve it)    |  |   progress)      |       |
|  +------------------+  +------------------+  +------------------+       |
|                                                                         |
|  +------------------+  +------------------+  +------------------+       |
|  |  Knowledge Layer |  |  Hypothesis      |  |  Evidence Layer  |       |
|  |  (shared facts   |  |  Layer (proposed |  |  (observations,  |       |
|  |   and rules)     |  |   explanations)  |  |   measurements)  |       |
|  +------------------+  +------------------+  +------------------+       |
|                                                                         |
+------------------------------------------------------------------------+
         ^          ^          ^          ^          ^
         |          |          |          |          |
    +--------+ +--------+ +--------+ +--------+ +--------+
    |Agent 1 | |Agent 2 | |Agent 3 | |Agent 4 | |Agent 5 |
    |Planner | |Coder   | |Tester  | |Reviewer| |Deployer|
    +--------+ +--------+ +--------+ +--------+ +--------+

    Each agent:
    1. Reads relevant layers
    2. Applies its expertise
    3. Writes results back
    4. Control decides next agent

7.5 Conflict Resolution

When multiple agents attempt to update the same memory concurrently, conflicts must be resolved deterministically:

Resolution Strategy Priority:
1. Evidence-based: Update with more supporting episodes wins
2. Confidence-based: Higher confidence score wins
3. Recency-based: Most recent update wins (LWW)
4. Authority-based: Higher-tier agent's update wins
5. Merge: If updates are complementary, merge both

The resolution strategy is selected based on the memory type:

Memory TypeDefault ResolutionRationale
Semantic factsEvidence-basedMore evidence = more reliable
Procedural skillsConfidence + recencySkills improve over time
Shared task stateRecency (LWW)Current state matters most
Knowledge graph edgesMerge (additive)Relationships accumulate

8. Privacy, Security, and Memory Governance

8.1 The Privacy Challenge

Agent memory systems store rich records of interactions, decisions, and outcomes. This data is inherently sensitive: it may contain personal information from users, proprietary business data, or security-relevant system details. Governance of agent memory requires controls at every layer of the architecture.

8.2 Access Control Model

Memory access is governed by a role-based access control (RBAC) model with four dimensions:

  1. Agent identity: Which agent is requesting access?
  2. Memory scope: Private, team, or organization-wide?
  3. Operation type: Read, write, update, delete?
  4. Content classification: Public, internal, confidential, restricted?
interface MemoryAccessPolicy { agent_id: string; allowed_scopes: ('private' | 'team' | 'organization')[]; allowed_operations: ('read' | 'write' | 'update' | 'delete')[]; content_classifications: ('public' | 'internal' | 'confidential' | 'restricted')[]; time_restrictions?: { retention_days: number; // Auto-delete after N days access_hours?: string; // Cron-style access window }; audit_level: 'none' | 'access' | 'content'; // Logging granularity }

8.3 PII Detection and Redaction

Before storing episodic memories, a PII detection pipeline identifies and redacts personally identifiable information. The pipeline uses both pattern matching (for structured PII like emails, phone numbers, SSNs) and NER models (for unstructured PII like names, addresses).

The redaction process replaces PII with typed tokens:

Input:  "John Smith called from 555-123-4567 about account #12345"
Output: "{{PERSON_1}} called from {{PHONE_1}} about account {{ACCOUNT_1}}"

Mapping stored separately (encrypted):
  PERSON_1  -> "John Smith"
  PHONE_1   -> "555-123-4567"
  ACCOUNT_1 -> "12345"

The mapping is stored in a separate, encrypted data store with stricter access controls than the memory store itself. This separation ensures that even if the memory store is compromised, PII is not directly exposed.

8.4 GDPR Right to Erasure

The General Data Protection Regulation (GDPR) establishes the right to erasure (Article 17): individuals can request the deletion of their personal data. For agent memory systems, this requires the ability to:

  1. Identify all memories associated with a specific individual
  2. Delete those memories from all stores (episodic, semantic, procedural)
  3. Propagate deletion to derived knowledge (if derived solely from that individual's data)
  4. Verify deletion completeness

Event sourcing complicates erasure because the event store is append-only. The solution is crypto-shredding: each individual's PII is encrypted with a unique key. Erasure is accomplished by destroying the encryption key, rendering the PII unrecoverable even though the encrypted data remains in the event store.

Storage:  [Event] -> [Encrypted PII] -> stored with key_id reference
Erasure:  DELETE FROM encryption_keys WHERE individual_id = ?
Result:   PII becomes irrecoverable; event structure preserved for audit trail

8.5 Memory Audit Trail

All memory operations are logged to an immutable audit trail:

interface MemoryAuditEntry { timestamp: number; agent_id: string; operation: 'read' | 'write' | 'update' | 'delete' | 'search'; memory_type: 'episodic' | 'semantic' | 'procedural' | 'working'; memory_ids: string[]; query?: string; // For search operations result_count?: number; access_justification: string; // Why the agent needed this memory policy_evaluation: { allowed: boolean; policy_id: string; denied_reason?: string; }; }

9. Benchmarks and Performance Analysis

9.1 Latency Benchmarks

Latency is the critical performance metric for agent memory because it directly impacts the agent's response time and throughput. We benchmark each storage tier under realistic workloads.

Table 4: Latency Benchmarks by Storage Tier

OperationRedis (Working)Qdrant (Vector)PostgreSQL (Event)
Single key read0.2msN/A2ms
Single key write0.3msN/A3ms
Vector search (k=10, 100K vectors)N/A5msN/A
Vector search (k=10, 1M vectors)N/A12msN/A
Vector search (k=10, 10M vectors)N/A28msN/A
Vector search + filter (1M vectors)N/A15msN/A
Event appendN/AN/A4ms
Event query (time range, 1M events)N/AN/A25ms
Event query (aggregate, 1M events)N/AN/A45ms
Snapshot readN/AN/A8ms
Full memory retrieval pipeline (end-to-end)N/AN/AN/A
-- Cache hit0.5ms--
-- Cache miss, vector search-15ms-
-- Cache miss, vector + event enrichment-15ms25ms
-- Total (typical)--35-50ms

All latencies measured at p50 on:

  • 3-node Qdrant cluster (4 vCPU, 16GB RAM each)
  • 3-node PostgreSQL (2 vCPU, 8GB RAM, primary + 2 replicas)
  • Redis 7 (2 vCPU, 4GB RAM, single instance per agent)
  • Network: Kubernetes pod-to-pod, same availability zone

9.2 Throughput Benchmarks

OperationThroughput (ops/sec)Configuration
Redis reads150,000Single instance, pipelining
Redis writes120,000Single instance, pipelining
Qdrant vector search8003 replicas, 1M vectors, k=10
Qdrant vector upsert5,000Batch size 100
PostgreSQL event insert15,000Batch size 100, async commit
PostgreSQL event query2,000Time-range queries
Embedding generation (OpenAI)3,000text-embedding-3-small, batch
Embedding generation (self-hosted BGE-M3)500RTX 4090, batch size 32

9.3 Cost Analysis

Table 5: Cost Per Million Memories by Deployment Model

ComponentSelf-Hosted (K8s)Managed CloudHybrid
Embedding generation$0.02 (self-hosted)$0.13 (OpenAI large)$0.02
Vector storage (Qdrant)$0.15/month (3-node)$0.45/month (Pinecone)$0.15
Event storage (PostgreSQL)$0.08/month$0.25/month (RDS)$0.08
Working memory (Redis)$0.03/month$0.10/month (ElastiCache)$0.03
Network/transfer$0.01/month$0.05/month$0.02
Total per 1M memories/month$0.29$0.98$0.30

Cost per memory operation:

Write (embed + store): $0.000013 (self-hosted) to $0.000130 (cloud)
Read (search + retrieve): $0.000002 (self-hosted) to $0.000008 (cloud)
Consolidation (per episode): $0.001 to $0.003 (LLM cost for abstraction)

9.4 Scalability Characteristics

The system scales along three dimensions:

  1. Vertical: Increasing RAM and CPU per node improves throughput but has diminishing returns beyond 32GB RAM per Qdrant node.

  2. Horizontal: Adding Qdrant replicas increases search throughput linearly. Adding PostgreSQL read replicas increases query throughput. Redis can be clustered for shared state.

  3. Sharding: Beyond 10M vectors per collection, Qdrant supports distributed sharding across nodes. This introduces shard management complexity but enables scaling to billions of vectors.

Scaling equations:

Search throughput = base_throughput * num_replicas * efficiency_factor
  where efficiency_factor ~= 0.85 (overhead for coordination)

Storage capacity = num_shards * per_shard_capacity
  where per_shard_capacity ~= 10M vectors (recommended max)

Write throughput = base_write_throughput * (1 / replication_factor)
  (writes must propagate to all replicas)

10. Future Directions

10.1 Neuromorphic Memory Architectures

Current vector-based memory systems are a crude approximation of biological memory. Emerging neuromorphic computing architectures (Intel Loihi 2, IBM NorthPole) offer hardware-level support for associative memory, content-addressable storage, and spike-timing-dependent plasticity. These architectures could enable agent memory systems that learn and consolidate at hardware speed, eliminating the latency and energy costs of software-based embedding and search.

10.2 Continual Learning Without Catastrophic Forgetting

A persistent challenge in agent learning is catastrophic forgetting: when learning new information overwrites previously learned knowledge. Current approaches (experience replay, elastic weight consolidation, progressive neural networks) address this partially. The memory architecture described in this paper provides an external solution -- by storing knowledge outside the model weights, the agent can learn continuously without risking forgetting. The integration of external memory with in-context learning represents a promising frontier.

10.3 Memory-Augmented Reasoning

Chain-of-thought reasoning and tree-of-thought search can be enhanced by memory-augmented retrieval at each reasoning step. Rather than reasoning purely from the current context, the agent retrieves relevant memories at each step to inform the next. This transforms reasoning from a context-limited process into a knowledge-grounded process.

10.4 Cross-Modal Memory

Current agent memory systems are primarily text-based. Extending memory to include visual observations (screenshots, diagrams), audio (conversations, alerts), and structured data (metrics, logs) requires multi-modal embedding models and cross-modal retrieval. Models like CLIP and ImageBind demonstrate that unified embedding spaces across modalities are achievable.


10. References

  1. Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In K. W. Spence & J. T. Spence (Eds.), The Psychology of Learning and Motivation (Vol. 2, pp. 89-195). Academic Press. DOI:10.1016/S0079-7421(08)60422-3

  2. Baddeley, A. D. (1974). Working memory. In G. H. Bower (Ed.), The Psychology of Learning and Motivation (Vol. 8, pp. 47-89). Academic Press. DOI:10.1016/S0079-7421(08)60452-1

  3. Baddeley, A. D. (2000). The episodic buffer: a new component of working memory? Trends in Cognitive Sciences, 4(11), 417-423. DOI:10.1016/S1364-6613(00)01538-2

  4. Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., ... & Sifre, L. (2022). Improving language models by retrieving from trillions of tokens. Proceedings of the 39th International Conference on Machine Learning (ICML). arXiv:2112.04426

  5. Ebbinghaus, H. (1885). Uber das Gedachtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot. English Translation (1913)

  6. Graves, A., Wayne, G., & Danihelka, I. (2014). Neural Turing machines. arXiv:1410.5401

  7. Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwinska, A., ... & Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471-476. DOI:10.1038/nature20101

  8. Hayes-Roth, B. (1985). A blackboard architecture for control. Artificial Intelligence, 26(3), 251-321. DOI:10.1016/0004-3702(85)90063-3

  9. Johnson, J., Douze, M., & Jegou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535-547. arXiv:1702.08734 | GitHub (FAISS)

  10. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:2004.04906

  11. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems (NeurIPS), 33, 9459-9474. arXiv:2005.11401

  12. Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4), 824-836. arXiv:1603.09320 | DOI:10.1109/TPAMI.2018.2889473

  13. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81-97. DOI:10.1037/h0043158 | PDF

  14. Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST). arXiv:2304.03442 | DOI:10.1145/3586183.3606763

  15. Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., ... & Gao, J. (2023). Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv:2302.12813

  16. Russell, S. J., & Norvig, P. (2021). Artificial Intelligence: A Modern Approach (4th ed.). Pearson. ISBN: 978-0134610993. Publisher

  17. Shapiro, M., Preguica, N., Baquero, C., & Zawirski, M. (2011). Conflict-free replicated data types. Proceedings of the 13th International Conference on Stabilization, Safety, and Security of Distributed Systems. DOI:10.1007/978-3-642-24550-3_29 | HAL

  18. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS), 36. arXiv:2303.11366

  19. Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. Advances in Neural Information Processing Systems (NeurIPS), 28. arXiv:1503.08895

  20. Tulving, E. (1972). Episodic and semantic memory. In E. Tulving & W. Donaldson (Eds.), Organization of Memory (pp. 381-403). Academic Press. Semantic Scholar

  21. Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26(1), 1-12. DOI:10.1037/h0080017

  22. Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. R. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345. arXiv:2308.11432

  23. Wayne, G., Hung, C. C., Amos, D., Mirza, M., Ahuja, A., Grabska-Barwinska, A., ... & Lillicrap, T. (2018). Unsupervised predictive memory in a goal-directed agent. arXiv:1803.10760

  24. Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks. Proceedings of the International Conference on Learning Representations (ICLR). arXiv:1410.3916

  25. Weaviate (2024). Vector database benchmarks. weaviate.io | GitHub

  26. Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., ... & Gui, T. (2023). The rise and potential of large language model based agents: A survey. arXiv:2309.07864

  27. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. Proceedings of the International Conference on Learning Representations (ICLR). arXiv:2210.03629

  28. Zhong, W., Guo, L., Gao, Q., Ye, H., & Wang, Y. (2024). MemoryBank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17), 19724-19731. arXiv:2305.10250


Appendix A: Glossary

TermDefinition
ANNApproximate Nearest Neighbor: sublinear search for similar vectors
CQRSCommand Query Responsibility Segregation: separate read/write models
CRDTConflict-free Replicated Data Type: eventually consistent distributed data structure
HNSWHierarchical Navigable Small World: graph-based ANN algorithm
LWWLast-Writer-Wins: conflict resolution strategy using timestamps
MRRMean Reciprocal Rank: retrieval quality metric
nDCGNormalized Discounted Cumulative Gain: graded relevance metric
PIIPersonally Identifiable Information
PVCPersistent Volume Claim: Kubernetes storage abstraction
RAGRetrieval-Augmented Generation: injecting retrieved context into LLM prompts
RBACRole-Based Access Control

Appendix B: Collection Configuration for Qdrant

{ "collection_name": "agent_episodic_memory", "vectors": { "size": 3072, "distance": "Cosine", "on_disk": false, "hnsw_config": { "m": 32, "ef_construct": 256, "full_scan_threshold": 10000 }, "quantization_config": { "scalar": { "type": "int8", "quantile": 0.99, "always_ram": true } } }, "optimizers_config": { "memmap_threshold": 20000, "indexing_threshold": 20000, "flush_interval_sec": 5 }, "replication_factor": 2, "write_consistency_factor": 1, "shard_number": 3 }

Appendix C: Event Store Schema (PostgreSQL)

CREATE TABLE agent_events ( event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), stream_id UUID NOT NULL, agent_id VARCHAR(64) NOT NULL, event_type VARCHAR(128) NOT NULL, version BIGINT NOT NULL, timestamp_ms BIGINT NOT NULL DEFAULT (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT, payload JSONB NOT NULL, metadata JSONB NOT NULL DEFAULT '{}', embedding_id UUID, -- Reference to vector in Qdrant CONSTRAINT unique_stream_version UNIQUE (stream_id, version) ); CREATE INDEX idx_events_agent_time ON agent_events (agent_id, timestamp_ms DESC); CREATE INDEX idx_events_stream ON agent_events (stream_id, version ASC); CREATE INDEX idx_events_type ON agent_events (event_type); CREATE INDEX idx_events_payload ON agent_events USING GIN (payload jsonb_path_ops); CREATE TABLE agent_snapshots ( snapshot_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), stream_id UUID NOT NULL, agent_id VARCHAR(64) NOT NULL, version BIGINT NOT NULL, timestamp_ms BIGINT NOT NULL, state JSONB NOT NULL, CONSTRAINT unique_snapshot_version UNIQUE (stream_id, version) ); CREATE INDEX idx_snapshots_stream ON agent_snapshots (stream_id, version DESC);

End of Whitepaper 03 BlueFly.io Agent Platform Series Copyright 2026 BlueFly.io. All rights reserved.

OSSAAgentsResearch