Agent Observability and Distributed Tracing: OpenTelemetry, Decision Auditing, and Operational Intelligence
Whitepaper 08 | Bluefly Agent Platform Series Version: 1.0 Date: 2026-02-07 Authors: Bluefly Platform Engineering Classification: Technical Reference
Abstract
The emergence of autonomous AI agents operating across distributed infrastructure introduces observability challenges that fundamentally exceed the capabilities of traditional Application Performance Monitoring (APM). While conventional systems measure latency, error rates, and throughput, agent-based architectures demand visibility into decision rationale, tool selection reasoning, autonomy level transitions, memory retrieval quality, and policy compliance. This whitepaper presents a comprehensive observability framework purpose-built for multi-agent systems, grounded in OpenTelemetry instrumentation, Prometheus metrics, structured decision audit trails, and Grafana-based operational intelligence. We address the unique requirements imposed by the EU AI Act's transparency mandates, demonstrate how to instrument agent decision pipelines with custom semantic spans, propose a decision audit schema that supports both regulatory compliance and post-hoc explainability, and provide production-ready configurations for Kubernetes-native observability stacks. The framework introduces a fourth observability pillar -- Decision Audit Trails -- alongside the traditional pillars of logs, metrics, and traces. Through detailed analysis of storage economics, cardinality management, anomaly detection, and cost optimization, we provide engineering teams with a complete blueprint for achieving operational intelligence over fleet-scale agent deployments. All code examples, Helm configurations, and dashboard specifications target TypeScript-based agent platforms running on Kubernetes with OpenTelemetry Collector pipelines.
1. Why Agent Observability Differs from APM
1.1 The Limits of Traditional Monitoring
Traditional Application Performance Monitoring (APM) evolved to answer three fundamental questions about software systems: How fast is it responding (latency)? How often does it fail (error rate)? How much work is it doing (throughput)? These three dimensions -- codified in Google's Service Level Indicators (SLIs) and popularized through the RED method (Rate, Errors, Duration) -- form the bedrock of modern observability practice. Tools such as Datadog, New Relic, and Dynatrace have refined these measurements to extraordinary precision, capable of tracking individual HTTP requests across microservice boundaries, correlating database query performance with user-facing latency, and alerting on statistical deviations from baseline behavior.
However, when applied to autonomous AI agent systems, traditional APM reveals a critical blind spot. Consider a multi-agent system where an orchestrator agent receives a user request, decomposes it into subtasks, delegates those subtasks to specialized worker agents, each of which selects tools, retrieves memories, makes decisions under uncertainty, and communicates intermediate results back through the coordination layer. Traditional APM would tell us that the orchestrator's HTTP endpoint responded in 3,200 milliseconds with a 200 status code. It would show us that downstream service calls completed within their SLA windows. It would confirm that throughput remained within expected bounds.
What traditional APM cannot tell us is far more consequential: Why did the orchestrator choose to decompose the task into three subtasks instead of four? Why did the research agent select a web search tool instead of querying the vector database? Why did the summarization agent's confidence score drop from 0.92 to 0.67 between consecutive invocations? Why did the system autonomously escalate its permission level from Tier 1 to Tier 3 without human approval? These questions represent the fundamental observability gap that agent-native monitoring must address.
1.2 The Four Dimensions of Agent Observability
Agent systems operate across four observability dimensions that collectively exceed the scope of traditional monitoring:
Dimension 1: Operational Telemetry (Traditional APM) This dimension encompasses the familiar metrics of system health -- request latency, error rates, resource utilization, and throughput. For agents, this extends to include LLM API call latency, token consumption rates, tool execution duration, and inter-agent communication overhead. While necessary, operational telemetry alone provides an incomplete picture.
Dimension 2: Cognitive Telemetry (Agent-Specific) Cognitive telemetry captures the reasoning processes that drive agent behavior. This includes the chain-of-thought traces that precede decisions, the assessment criteria applied when selecting between alternative actions, the confidence scores assigned to different options, and the context windows assembled from memory retrieval. Cognitive telemetry answers the question "why did the agent do what it did?" rather than merely "what did the agent do?"
Dimension 3: Behavioral Telemetry (Agent-Specific) Behavioral telemetry tracks patterns in agent behavior over time. This includes autonomy level transitions (when and why an agent requested or was granted elevated permissions), tool usage patterns (which tools are preferred under which circumstances), memory access patterns (how often agents retrieve versus create memories), and communication patterns (how frequently agents coordinate versus operate independently). Behavioral telemetry enables fleet-level analysis and anomaly detection.
Dimension 4: Compliance Telemetry (Regulatory) Compliance telemetry provides the audit trails required by regulatory frameworks such as the EU AI Act, which mandates that high-risk AI systems maintain detailed records of their decision-making processes. This includes immutable decision logs with full context capture, policy assessment records showing which rules were consulted and how they influenced outcomes, and retention-compliant storage that satisfies industry-specific archival requirements.
Table 1: Agent Observability Dimensions vs Traditional APM
+---------------------+----------------------+------------------------------+
| Dimension | Traditional APM | Agent Observability |
+---------------------+----------------------+------------------------------+
| Latency | Request/Response | Decision latency, reasoning |
| | time | time, tool selection time |
+---------------------+----------------------+------------------------------+
| Errors | HTTP status codes, | Decision errors, policy |
| | exceptions | violations, hallucinations, |
| | | confidence drops |
+---------------------+----------------------+------------------------------+
| Throughput | Requests per second | Decisions per minute, tasks |
| | | completed, tokens consumed |
+---------------------+----------------------+------------------------------+
| Decision Rationale | Not captured | Chain-of-thought, alternative|
| | | assessment, context assembly |
+---------------------+----------------------+------------------------------+
| Tool Selection | Not captured | Tool choice reasoning, tool |
| | | effectiveness tracking |
+---------------------+----------------------+------------------------------+
| Autonomy Changes | Not captured | Level transitions, escalation|
| | | triggers, human-in-the-loop |
+---------------------+----------------------+------------------------------+
| Memory Quality | Cache hit/miss | Retrieval relevance, memory |
| | | freshness, embedding quality |
+---------------------+----------------------+------------------------------+
| Policy Compliance | Not captured | Rule checking, constraint |
| | | satisfaction, violation logs |
+---------------------+----------------------+------------------------------+
1.3 Regulatory Imperatives: The EU AI Act
The EU Artificial Intelligence Act, which entered full enforcement in 2025, imposes specific transparency and record-keeping obligations on high-risk AI systems. Article 12 requires that high-risk AI systems be designed with logging capabilities that enable the recording of events relevant to identifying situations that may result in the AI system presenting a risk. Article 13 mandates that high-risk AI systems be designed to ensure that their operation is sufficiently transparent to enable users to interpret the system's output and use it appropriately.
For autonomous agent systems, these requirements translate into concrete technical obligations:
- Decision Logging: Every significant decision made by an agent must be recorded with sufficient context to reconstruct the reasoning process.
- Traceability: It must be possible to trace any agent output back through the chain of decisions, tool invocations, and memory retrievals that produced it.
- Explainability: The system must be capable of generating human-readable explanations for its decisions, including what alternatives were considered and why they were rejected.
- Auditability: Decision records must be stored in tamper-resistant formats with retention periods that satisfy sector-specific requirements.
These obligations motivate the introduction of a fourth observability pillar -- Decision Audit Trails -- which we develop in detail in Section 4.
1.4 The Fourth Pillar: Decision Audit Trails
The traditional three pillars of observability -- logs, metrics, and traces -- were formalized by Cindy Sridharan in her influential work "Distributed Systems Observability" (2018). We propose extending this model with a fourth pillar specifically designed for agent systems:
Diagram 1: Four Pillars of Agent Observability
+------------------------------------------------------------------+
| AGENT OBSERVABILITY |
+------------------------------------------------------------------+
| |
| +-----------+ +-----------+ +-----------+ +-----------------+ |
| | | | | | | | | |
| | LOGS | | METRICS | | TRACES | | DECISION AUDIT | |
| | | | | | | | TRAILS | |
| | Structured| | Counters | | Spans | | | |
| | events | | Gauges | | Context | | Reasoning | |
| | Correlation| | Histograms| | Propagation| | Alternatives | |
| | PII | | Recording | | Sampling | | Policy check | |
| | redaction | | rules | | strategies| | Confidence | |
| | | | Alerts | | Storage | | Outcome | |
| | | | | | | | Append-only | |
| +-----------+ +-----------+ +-----------+ +-----------------+ |
| | | | | |
| v v v v |
| +----------------------------------------------------------+ |
| | OPERATIONAL INTELLIGENCE | |
| | Dashboards | Anomaly Detection | Cost Optimization | |
| +----------------------------------------------------------+ |
+------------------------------------------------------------------+
Decision Audit Trails differ from traditional logs in several critical respects: they are append-only and immutable once written; they capture structured decision metadata rather than free-form text; they include the full context that was available to the agent at decision time; they record alternatives that were considered and rejected; and they link to the policy rules that were checked during the decision process. This pillar provides the foundation for both regulatory compliance and operational learning.
2. OpenTelemetry for Agents
2.1 Semantic Conventions for Agent Spans
OpenTelemetry provides a vendor-neutral framework for collecting telemetry data, but its default semantic conventions were designed for traditional web services and databases. Agent systems require custom semantic conventions that capture the cognitive and behavioral dimensions of agent operation. We define the following span types for agent instrumentation:
Table 2: Agent-Specific Span Types
+-------------------------+--------------------------------------------+-------------------+
| Span Name | Purpose | Key Attributes |
+-------------------------+--------------------------------------------+-------------------+
| agent.plan | Task decomposition and planning | plan.steps, |
| | | plan.strategy, |
| | | plan.confidence |
+-------------------------+--------------------------------------------+-------------------+
| agent.decide | Decision point with alternatives | decision.type, |
| | | decision.options, |
| | | decision.selected,|
| | | decision.rationale|
+-------------------------+--------------------------------------------+-------------------+
| agent.tool_call | Tool invocation with parameters | tool.name, |
| | | tool.params, |
| | | tool.result, |
| | | tool.tokens_used |
+-------------------------+--------------------------------------------+-------------------+
| agent.memory_retrieve | Memory/knowledge retrieval | memory.query, |
| | | memory.results, |
| | | memory.relevance, |
| | | memory.source |
+-------------------------+--------------------------------------------+-------------------+
| agent.communicate | Inter-agent message exchange | comm.sender, |
| | | comm.receiver, |
| | | comm.protocol, |
| | | comm.payload_size |
+-------------------------+--------------------------------------------+-------------------+
| agent.llm_call | LLM API invocation | llm.provider, |
| | | llm.model, |
| | | llm.tokens_in, |
| | | llm.tokens_out, |
| | | llm.cost |
+-------------------------+--------------------------------------------+-------------------+
| agent.policy_check | Policy/constraint checking | policy.name, |
| | | policy.result, |
| | | policy.violations |
+-------------------------+--------------------------------------------+-------------------+
| agent.autonomy_change | Autonomy level transition | autonomy.from, |
| | | autonomy.to, |
| | | autonomy.trigger |
+-------------------------+--------------------------------------------+-------------------+
2.2 TypeScript Instrumentation
The following TypeScript implementation demonstrates how to instrument an agent's decision pipeline with OpenTelemetry custom spans. This code integrates with the @bluefly/agent-tracer package for distributed tracing across the Bluefly Agent Platform:
import { trace, context, SpanKind, SpanStatusCode, propagation } from '@opentelemetry/api'; import { SemanticAttributes } from '@opentelemetry/semantic-conventions'; const tracer = trace.getTracer('agent-platform', '1.0.0'); interface DecisionContext { taskId: string; agentId: string; input: string; availableTools: string[]; memoryContext: Record<string, unknown>; policyConstraints: string[]; } interface DecisionResult { selectedAction: string; confidence: number; reasoning: string; alternativesConsidered: Array<{ action: string; score: number; rejectionReason: string; }>; policyAssessment: { policiesChecked: string[]; violations: string[]; passed: boolean; }; } export class AgentInstrumentor { /** * Instruments the complete agent decision pipeline. * Creates a parent span for the task and child spans for each * cognitive operation: planning, memory retrieval, decision-making, * tool execution, and inter-agent communication. */ async executeInstrumentedTask( decisionContext: DecisionContext ): Promise<DecisionResult> { return tracer.startActiveSpan( 'agent.task', { kind: SpanKind.INTERNAL, attributes: { 'agent.id': decisionContext.agentId, 'agent.task.id': decisionContext.taskId, 'agent.task.input_length': decisionContext.input.length, 'agent.tools.available_count': decisionContext.availableTools.length, }, }, async (taskSpan) => { try { // Phase 1: Planning const plan = await this.instrumentPlanning(decisionContext); // Phase 2: Memory Retrieval const memories = await this.instrumentMemoryRetrieval( decisionContext.input, decisionContext.agentId ); // Phase 3: Decision Making const decision = await this.instrumentDecision( decisionContext, plan, memories ); // Phase 4: Policy Checking const policyResult = await this.instrumentPolicyCheck( decision, decisionContext.policyConstraints ); // Phase 5: Tool Execution if (policyResult.passed) { await this.instrumentToolExecution( decision.selectedAction, decisionContext ); } taskSpan.setStatus({ code: SpanStatusCode.OK }); return decision; } catch (error) { taskSpan.setStatus({ code: SpanStatusCode.ERROR, message: error instanceof Error ? error.message : 'Unknown error', }); taskSpan.recordException(error as Error); throw error; } finally { taskSpan.end(); } } ); } private async instrumentPlanning(ctx: DecisionContext): Promise<string[]> { return tracer.startActiveSpan( 'agent.plan', { kind: SpanKind.INTERNAL, attributes: { 'agent.plan.strategy': 'decomposition', 'agent.plan.input_tokens': ctx.input.length, }, }, async (planSpan) => { const startTime = Date.now(); // Planning logic invocation const steps = await this.planTask(ctx.input); planSpan.setAttribute('agent.plan.steps_count', steps.length); planSpan.setAttribute('agent.plan.confidence', 0.87); planSpan.setAttribute( 'agent.plan.duration_ms', Date.now() - startTime ); planSpan.setStatus({ code: SpanStatusCode.OK }); planSpan.end(); return steps; } ); } private async instrumentMemoryRetrieval( query: string, agentId: string ): Promise<unknown[]> { return tracer.startActiveSpan( 'agent.memory_retrieve', { kind: SpanKind.CLIENT, attributes: { 'memory.query_length': query.length, 'memory.source': 'qdrant', 'memory.agent_id': agentId, 'memory.collection': 'agent_memories', }, }, async (memorySpan) => { const startTime = Date.now(); const results = await this.retrieveMemories(query); memorySpan.setAttribute('memory.results_count', results.length); memorySpan.setAttribute( 'memory.avg_relevance', this.calculateAvgRelevance(results) ); memorySpan.setAttribute( 'memory.latency_ms', Date.now() - startTime ); memorySpan.setAttribute( 'memory.cache_hit', false ); memorySpan.setStatus({ code: SpanStatusCode.OK }); memorySpan.end(); return results; } ); } private async instrumentDecision( ctx: DecisionContext, plan: string[], memories: unknown[] ): Promise<DecisionResult> { return tracer.startActiveSpan( 'agent.decide', { kind: SpanKind.INTERNAL, attributes: { 'decision.type': 'tool_selection', 'decision.context_size': JSON.stringify(memories).length, 'decision.plan_steps': plan.length, 'decision.options_count': ctx.availableTools.length, }, }, async (decisionSpan) => { const result = await this.makeDecision(ctx, plan, memories); decisionSpan.setAttribute( 'decision.selected', result.selectedAction ); decisionSpan.setAttribute( 'decision.confidence', result.confidence ); decisionSpan.setAttribute( 'decision.alternatives_count', result.alternativesConsidered.length ); decisionSpan.setAttribute( 'decision.reasoning_length', result.reasoning.length ); // Record each alternative as a span event for (const alt of result.alternativesConsidered) { decisionSpan.addEvent('decision.alternative_rejected', { 'alternative.action': alt.action, 'alternative.score': alt.score, 'alternative.rejection_reason': alt.rejectionReason, }); } decisionSpan.setStatus({ code: SpanStatusCode.OK }); decisionSpan.end(); return result; } ); } private async instrumentPolicyCheck( decision: DecisionResult, constraints: string[] ): Promise<{ passed: boolean; violations: string[] }> { return tracer.startActiveSpan( 'agent.policy_check', { kind: SpanKind.INTERNAL, attributes: { 'policy.constraints_count': constraints.length, 'policy.action_under_review': decision.selectedAction, }, }, async (policySpan) => { const checking = await this.checkPolicies( decision, constraints ); policySpan.setAttribute('policy.passed', checking.passed); policySpan.setAttribute( 'policy.violations_count', checking.violations.length ); if (!checking.passed) { policySpan.setStatus({ code: SpanStatusCode.ERROR, message: `Policy violations: ${checking.violations.join(', ')}`, }); policySpan.addEvent('policy.violation_detected', { 'violations': JSON.stringify(checking.violations), }); } else { policySpan.setStatus({ code: SpanStatusCode.OK }); } policySpan.end(); return checking; } ); } private async instrumentToolExecution( toolName: string, ctx: DecisionContext ): Promise<void> { return tracer.startActiveSpan( 'agent.tool_call', { kind: SpanKind.CLIENT, attributes: { 'tool.name': toolName, 'tool.agent_id': ctx.agentId, 'tool.task_id': ctx.taskId, }, }, async (toolSpan) => { const startTime = Date.now(); try { const result = await this.executeTool(toolName, ctx); toolSpan.setAttribute('tool.success', true); toolSpan.setAttribute( 'tool.duration_ms', Date.now() - startTime ); toolSpan.setAttribute( 'tool.result_size', JSON.stringify(result).length ); toolSpan.setAttribute('tool.tokens_used', result.tokensUsed || 0); toolSpan.setStatus({ code: SpanStatusCode.OK }); } catch (error) { toolSpan.setAttribute('tool.success', false); toolSpan.setAttribute('tool.error', (error as Error).message); toolSpan.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message, }); toolSpan.recordException(error as Error); throw error; } finally { toolSpan.end(); } } ); } // Placeholder methods for actual implementations private async planTask(input: string): Promise<string[]> { return []; } private async retrieveMemories(query: string): Promise<unknown[]> { return []; } private calculateAvgRelevance(results: unknown[]): number { return 0; } private async makeDecision( ctx: DecisionContext, plan: string[], memories: unknown[] ): Promise<DecisionResult> { return {} as DecisionResult; } private async checkPolicies( decision: DecisionResult, constraints: string[] ): Promise<{ passed: boolean; violations: string[] }> { return { passed: true, violations: [] }; } private async executeTool( name: string, ctx: DecisionContext ): Promise<{ tokensUsed: number }> { return { tokensUsed: 0 }; } }
2.3 Trace Context Propagation
In multi-agent systems, trace context must propagate across agent boundaries to maintain end-to-end visibility. When Agent A delegates a subtask to Agent B via an inter-agent communication protocol, the trace context (trace ID, span ID, and trace flags) must be transmitted alongside the task payload. OpenTelemetry's W3C Trace Context propagation format provides the standard mechanism:
import { propagation, context, ROOT_CONTEXT } from '@opentelemetry/api'; import { W3CTraceContextPropagator } from '@opentelemetry/core'; export class AgentContextPropagator { private propagator = new W3CTraceContextPropagator(); /** * Injects trace context into an outgoing inter-agent message. * The receiving agent extracts this context to create child spans * that correctly parent under the sending agent's trace. */ injectContext(message: Record<string, unknown>): Record<string, string> { const carrier: Record<string, string> = {}; propagation.inject(context.active(), carrier); // Add agent-specific baggage carrier['x-agent-id'] = message['agentId'] as string; carrier['x-task-id'] = message['taskId'] as string; carrier['x-autonomy-level'] = String( message['autonomyLevel'] || 'tier_1' ); return carrier; } /** * Extracts trace context from an incoming inter-agent message. * Creates a new context that parents subsequent spans under * the original trace, preserving end-to-end visibility. */ extractContext( carrier: Record<string, string> ): ReturnType<typeof context.active> { return propagation.extract(ROOT_CONTEXT, carrier); } }
Baggage Propagation: Beyond trace context, OpenTelemetry baggage allows agent-specific metadata to travel with requests across service boundaries. This is particularly valuable for carrying the agent's current autonomy level, the originating user's session ID, and policy scope identifiers through multi-hop agent chains.
2.4 Sampling Strategies
Agent systems generate significantly more telemetry data than traditional microservices due to the verbosity of cognitive traces. Effective sampling strategies are critical to managing storage costs without sacrificing observability quality:
Head-Based Sampling: Decisions are made at trace initiation. A fixed percentage of traces are sampled (e.g., 10%), and the decision propagates to all downstream spans. This approach is simple and predictable but risks missing important traces that only become interesting after they complete (e.g., traces that encounter errors late in processing).
Tail-Based Sampling: Decisions are deferred until trace completion, allowing the sampler to retain traces that exhibit interesting characteristics -- high latency, errors, policy violations, low confidence decisions, or autonomy level changes. This requires buffering complete traces before making sampling decisions, which increases memory requirements at the collector tier.
Priority-Based Sampling (Agent-Specific): We propose a priority-based sampling strategy that combines head-based efficiency with tail-based intelligence:
Formula 1: Priority-Based Sampling Score
S(trace) = w_e * E + w_v * V + w_c * (1 - C) + w_a * A + w_l * L
Where:
S(trace) = sampling priority score (0.0 to 1.0)
E = error indicator (1 if any span has error, 0 otherwise)
V = policy violation indicator (1 if any violation detected)
C = minimum confidence score across all decision spans
A = autonomy change indicator (1 if autonomy level changed)
L = normalized latency (actual / p99 threshold)
Weights: w_e = 0.3, w_v = 0.3, w_c = 0.2, w_a = 0.1, w_l = 0.1
Decision: Sample if S(trace) > 0.3 OR random() < base_rate(0.05)
This formula ensures that traces containing errors, policy violations, low-confidence decisions, or autonomy changes are always sampled, while routine successful traces are sampled at a configurable base rate.
2.5 Storage Estimation
Trace storage requirements for agent systems are substantially higher than for traditional microservices due to the additional cognitive span types and their associated attribute payloads:
Formula 2: Trace Storage Estimation
daily_storage = traces_per_sec * avg_spans_per_trace * avg_span_size_bytes
* 86400 * (1 - compression_ratio)
Where for agent systems:
traces_per_sec = 50 (moderate fleet)
avg_spans_per_trace = 12 (plan + memory + decide + policy + tool + llm + comm)
avg_span_size_bytes = 2048 (larger due to reasoning attributes)
compression_ratio = 0.6 (typical for structured data)
daily_storage = 50 * 12 * 2048 * 86400 * 0.4
= 50 * 12 * 2048 * 86400 * 0.4
= 42.5 GB/day (unsampled)
With 10% sampling:
daily_storage = 4.25 GB/day
monthly_storage = 127.5 GB/month
annual_storage = 1.55 TB/year (at 10% sampling)
Diagram 2: Distributed Agent Trace Data Flow
+------------------+ +------------------+ +------------------+
| Agent Process | | Agent Process | | Agent Process |
| (Orchestrator) | | (Researcher) | | (Executor) |
| | | | | |
| [agent.task] | | [agent.task] | | [agent.task] |
| |--[plan] | | |--[plan] | | |--[tool_call] |
| |--[decide] ---|---->| |--[memory] | | |--[policy] |
| |--[comm] | | |--[llm_call] |---->| |--[execute] |
| | | |--[decide] | | |
+--------|---------+ +--------|---------+ +--------|---------+
| | |
v v v
+------------------------------------------------------------------+
| OTEL Collector Pipeline |
| |
| [Receiver]-->[Processor]-->[Processor]-->[Exporter] |
| OTLP Batch Tail-Based Tempo/Jaeger |
| Sampling |
+------------------------------------------------------------------+
| | |
v v v
+------------------------------------------------------------------+
| Trace Storage (Tempo) |
| |
| Object Storage (S3/MinIO) + Index (search by traceID, |
| agent.id, decision.confidence, policy.violations) |
+------------------------------------------------------------------+
2.6 OpenTelemetry Collector Configuration
The OTEL Collector serves as the central aggregation point for all agent telemetry. The following configuration demonstrates a production-ready pipeline with tail-based sampling optimized for agent workloads:
# otel-collector-config.yaml receivers: otlp: protocols: grpc: endpoint: "0.0.0.0:4317" max_recv_msg_size_mib: 16 http: endpoint: "0.0.0.0:4318" processors: batch: timeout: 5s send_batch_size: 8192 send_batch_max_size: 16384 memory_limiter: check_interval: 5s limit_mib: 4096 spike_limit_mib: 1024 tail_sampling: decision_wait: 30s num_traces: 100000 expected_new_traces_per_sec: 50 policies: # Always sample traces with errors - name: errors-policy type: status_code status_code: status_codes: [ERROR] # Always sample policy violations - name: policy-violations type: string_attribute string_attribute: key: policy.violations_count values: ["1", "2", "3", "4", "5"] # Sample low-confidence decisions - name: low-confidence type: numeric_attribute numeric_attribute: key: decision.confidence min_value: 0 max_value: 0.5 # Always sample autonomy changes - name: autonomy-changes type: string_attribute string_attribute: key: autonomy.trigger values: ["escalation", "emergency", "policy_override"] # Base rate sampling for normal traces - name: base-rate type: probabilistic probabilistic: sampling_percentage: 5 attributes: actions: # Redact PII from span attributes - key: user.email action: hash - key: user.ip action: delete - key: agent.task.input action: hash exporters: otlp/tempo: endpoint: "tempo:4317" tls: insecure: false ca_file: /etc/ssl/certs/ca.pem prometheus: endpoint: "0.0.0.0:8889" namespace: agent_platform resource_to_telemetry_conversion: enabled: true loki: endpoint: "http://loki:3100/loki/api/v1/push" labels: attributes: agent.id: "agent_id" agent.task.id: "task_id" service: telemetry: logs: level: info metrics: address: "0.0.0.0:8888" pipelines: traces: receivers: [otlp] processors: [memory_limiter, tail_sampling, batch, attributes] exporters: [otlp/tempo] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [loki]
3. Prometheus Metrics for Agent Systems
3.1 Custom Agent Metrics
Traditional Prometheus metrics for web services focus on HTTP request rates, response times, and error codes. Agent systems require metrics that capture cognitive performance, economic efficiency, and behavioral patterns. The following metric definitions form the foundation of agent-native monitoring:
import { Counter, Histogram, Gauge, Registry, } from 'prom-client'; const registry = new Registry(); // --- Task Performance Metrics --- const taskDuration = new Histogram({ name: 'agent_task_duration_seconds', help: 'Duration of agent task execution including all cognitive phases', labelNames: ['agent_id', 'task_type', 'status', 'autonomy_level'], buckets: [0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300], registers: [registry], }); const taskTotal = new Counter({ name: 'agent_tasks_total', help: 'Total number of tasks processed by agents', labelNames: ['agent_id', 'task_type', 'status'], registers: [registry], }); // --- Token Economics Metrics --- const tokenUsage = new Counter({ name: 'agent_token_usage_total', help: 'Total tokens consumed by agent LLM calls', labelNames: ['agent_id', 'model', 'direction', 'provider'], registers: [registry], }); const tokenCost = new Counter({ name: 'agent_token_cost_dollars', help: 'Estimated cost in USD of tokens consumed', labelNames: ['agent_id', 'model', 'provider'], registers: [registry], }); // --- Decision Metrics --- const decisionCount = new Counter({ name: 'agent_decisions_total', help: 'Total decisions made by agents', labelNames: ['agent_id', 'decision_type', 'outcome'], registers: [registry], }); const decisionConfidence = new Histogram({ name: 'agent_decision_confidence', help: 'Distribution of decision confidence scores', labelNames: ['agent_id', 'decision_type'], buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0], registers: [registry], }); // --- Memory Performance Metrics --- const memoryLatency = new Histogram({ name: 'agent_memory_retrieval_latency_seconds', help: 'Latency of memory/knowledge retrieval operations', labelNames: ['agent_id', 'memory_source', 'cache_hit'], buckets: [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5], registers: [registry], }); const memoryRelevance = new Histogram({ name: 'agent_memory_retrieval_relevance', help: 'Relevance scores of retrieved memories', labelNames: ['agent_id', 'memory_source'], buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], registers: [registry], }); // --- Autonomy Metrics --- const autonomyLevel = new Gauge({ name: 'agent_autonomy_level', help: 'Current autonomy level of the agent (1-4)', labelNames: ['agent_id'], registers: [registry], }); const autonomyChanges = new Counter({ name: 'agent_autonomy_changes_total', help: 'Total autonomy level transitions', labelNames: ['agent_id', 'from_level', 'to_level', 'trigger'], registers: [registry], }); // --- Policy Compliance Metrics --- const policyViolations = new Counter({ name: 'agent_policy_violations_total', help: 'Total policy violations detected', labelNames: ['agent_id', 'policy_name', 'severity'], registers: [registry], }); const policyCheckDuration = new Histogram({ name: 'agent_policy_check_duration_seconds', help: 'Duration of policy compliance checks', labelNames: ['agent_id', 'policy_name'], buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25], registers: [registry], }); // --- Error Metrics --- const errorRate = new Counter({ name: 'agent_errors_total', help: 'Total errors encountered by agents', labelNames: ['agent_id', 'error_type', 'severity', 'recoverable'], registers: [registry], }); // --- Tool Usage Metrics --- const toolCallDuration = new Histogram({ name: 'agent_tool_call_duration_seconds', help: 'Duration of individual tool invocations', labelNames: ['agent_id', 'tool_name', 'status'], buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30], registers: [registry], }); const toolCallTotal = new Counter({ name: 'agent_tool_calls_total', help: 'Total tool invocations by agents', labelNames: ['agent_id', 'tool_name', 'status'], registers: [registry], });
3.2 Recording Rules
Recording rules pre-compute frequently queried aggregations, reducing query-time latency and improving dashboard responsiveness. The following Prometheus recording rules derive key operational indicators from the raw metrics:
# prometheus-recording-rules.yaml groups: - name: agent_performance_rules interval: 30s rules: # Task success rate per agent (5-minute window) - record: agent:task_success_rate:5m expr: | sum(rate(agent_tasks_total{status="success"}[5m])) by (agent_id) / sum(rate(agent_tasks_total[5m])) by (agent_id) # Cost per successful task - record: agent:cost_per_task:5m expr: | sum(rate(agent_token_cost_dollars[5m])) by (agent_id) / sum(rate(agent_tasks_total{status="success"}[5m])) by (agent_id) # Average decision confidence per agent - record: agent:avg_decision_confidence:5m expr: | sum(rate(agent_decision_confidence_sum[5m])) by (agent_id) / sum(rate(agent_decision_confidence_count[5m])) by (agent_id) # Memory retrieval hit rate - record: agent:memory_hit_rate:5m expr: | sum(rate(agent_memory_retrieval_latency_seconds_count{cache_hit="true"}[5m])) by (agent_id) / sum(rate(agent_memory_retrieval_latency_seconds_count[5m])) by (agent_id) # Token efficiency (tokens per successful task) - record: agent:tokens_per_task:5m expr: | sum(rate(agent_token_usage_total[5m])) by (agent_id) / sum(rate(agent_tasks_total{status="success"}[5m])) by (agent_id) # Policy violation rate - record: agent:violation_rate:5m expr: | sum(rate(agent_policy_violations_total[5m])) by (agent_id) / sum(rate(agent_decisions_total[5m])) by (agent_id) # Tool effectiveness score - record: agent:tool_success_rate:5m expr: | sum(rate(agent_tool_calls_total{status="success"}[5m])) by (agent_id, tool_name) / sum(rate(agent_tool_calls_total[5m])) by (agent_id, tool_name) # Fleet-wide error rate - record: fleet:error_rate:5m expr: | sum(rate(agent_errors_total[5m])) / sum(rate(agent_tasks_total[5m])) # Fleet-wide token cost rate (dollars per hour) - record: fleet:cost_rate_dollars_per_hour:5m expr: | sum(rate(agent_token_cost_dollars[5m])) * 3600
3.3 Alert Rules
Alert rules detect operational anomalies that require human intervention. Agent-specific alerts must go beyond traditional error rate thresholds to include cognitive performance degradation, policy compliance failures, and economic anomalies:
# prometheus-alert-rules.yaml groups: - name: agent_critical_alerts rules: # High error rate - alert: AgentHighErrorRate expr: agent:task_success_rate:5m < 0.95 for: 5m labels: severity: critical category: reliability annotations: summary: "Agent {{ $labels.agent_id }} error rate exceeds 5%" description: > Agent {{ $labels.agent_id }} success rate has dropped to {{ $value | humanizePercentage }} over the past 5 minutes. Investigate recent decision traces for root cause. runbook_url: "https://wiki.internal/runbooks/agent-error-rate" # Autonomy level anomaly - alert: AgentAutonomyEscalation expr: | increase(agent_autonomy_changes_total{to_level=~"tier_3|tier_4"}[15m]) > 0 for: 0m labels: severity: warning category: security annotations: summary: "Agent {{ $labels.agent_id }} escalated to {{ $labels.to_level }}" description: > Agent {{ $labels.agent_id }} autonomy level changed from {{ $labels.from_level }} to {{ $labels.to_level }}. Trigger: {{ $labels.trigger }}. Review immediately. # Policy violations detected - alert: AgentPolicyViolation expr: increase(agent_policy_violations_total[5m]) > 0 for: 0m labels: severity: critical category: compliance annotations: summary: "Policy violation by agent {{ $labels.agent_id }}" description: > Agent {{ $labels.agent_id }} violated policy {{ $labels.policy_name }} (severity: {{ $labels.severity }}). Immediate review required for compliance. # Decision confidence drop - alert: AgentLowConfidence expr: agent:avg_decision_confidence:5m < 0.5 for: 10m labels: severity: warning category: performance annotations: summary: "Agent {{ $labels.agent_id }} confidence below threshold" description: > Average decision confidence for agent {{ $labels.agent_id }} has dropped to {{ $value }} over the past 10 minutes. Check memory retrieval quality and model performance. # Token cost spike - alert: AgentCostSpike expr: | fleet:cost_rate_dollars_per_hour:5m > 50 for: 15m labels: severity: warning category: economics annotations: summary: "Fleet token cost exceeds $50/hour" description: > Current fleet-wide token cost is ${{ $value }}/hour. Check for runaway agents or inefficient token usage patterns. # Memory retrieval degradation - alert: AgentMemoryDegradation expr: | histogram_quantile(0.95, rate(agent_memory_retrieval_latency_seconds_bucket[5m]) ) > 2.0 for: 10m labels: severity: warning category: performance annotations: summary: "Memory retrieval P95 latency exceeds 2s for {{ $labels.agent_id }}" description: > P95 memory retrieval latency for agent {{ $labels.agent_id }} is {{ $value }}s. Check Qdrant cluster health and index status. # Fleet-wide task throughput drop - alert: FleetThroughputDrop expr: | sum(rate(agent_tasks_total[5m])) < 0.5 * sum(rate(agent_tasks_total[1h] offset 1h)) for: 15m labels: severity: warning category: capacity annotations: summary: "Fleet task throughput dropped by more than 50%" description: > Current task throughput is significantly below the historical baseline. Investigate infrastructure issues or upstream request volume changes.
3.4 Cardinality Management
Prometheus performance degrades as the number of unique time series increases. For agent systems, cardinality is a particular concern because each agent instance, tool, decision type, and model combination creates distinct label sets:
Formula 3: Cardinality Estimation
total_series = SUM over metrics of: PRODUCT over labels of: cardinality(label_i)
Example for agent_task_duration_seconds:
Labels: agent_id(50) * task_type(10) * status(3) * autonomy_level(4)
Series: 50 * 10 * 3 * 4 = 6,000
Example for agent_tool_calls_total:
Labels: agent_id(50) * tool_name(20) * status(2)
Series: 50 * 20 * 2 = 2,000
Total across all 12 custom metrics:
Estimated: ~45,000 active series (within 100K target)
Table 3: Cardinality Budget Allocation
+-------------------------------+---------+------------+
| Metric Category | Series | % of Budget|
+-------------------------------+---------+------------+
| Task Performance (4 metrics) | 12,000 | 26.7% |
| Token Economics (2 metrics) | 5,000 | 11.1% |
| Decision Metrics (2 metrics) | 4,000 | 8.9% |
| Memory Metrics (2 metrics) | 3,000 | 6.7% |
| Autonomy Metrics (2 metrics) | 1,500 | 3.3% |
| Policy Metrics (2 metrics) | 2,500 | 5.6% |
| Tool Usage (2 metrics) | 4,000 | 8.9% |
| Error Metrics (1 metric) | 3,000 | 6.7% |
| Infrastructure (built-in) | 10,000 | 22.2% |
+-------------------------------+---------+------------+
| TOTAL | 45,000 | 100% |
| TARGET MAXIMUM | 100,000 | |
+-------------------------------+---------+------------+
To keep cardinality within the 100,000 series budget:
- Limit label values: Use agent role labels (e.g., "orchestrator", "researcher") instead of unique agent instance IDs where fleet-level aggregation suffices.
- Drop high-cardinality labels: Never include request IDs, user IDs, or trace IDs as metric labels. These belong in traces and logs.
- Use relabeling: Configure Prometheus relabeling rules to drop or hash high-cardinality labels at scrape time.
- Monitor cardinality: Alert when
prometheus_tsdb_head_seriesapproaches the budget threshold.
4. Decision Audit Trails
4.1 Decision Record Schema
The Decision Audit Trail captures every significant decision made by an agent in a structured, append-only format. Each decision record includes the action taken, the reasoning behind it, the context that was available, the alternatives that were considered, the policy rules that were checked, the confidence level, and the eventual outcome:
/** * Schema for agent decision audit records. * All fields are immutable once written. * Records are stored in append-only storage with * cryptographic integrity verification. */ interface DecisionAuditRecord { // --- Identity --- id: string; // UUID v7 (time-ordered) traceId: string; // OpenTelemetry trace ID spanId: string; // OpenTelemetry span ID agentId: string; // Agent instance identifier agentRole: string; // OSSA role (analyzer, executor, etc.) taskId: string; // Parent task identifier timestamp: string; // ISO 8601 with microsecond precision sequenceNumber: number; // Monotonic within agent instance // --- Action --- action: { type: string; // 'tool_call' | 'delegate' | 'respond' | 'escalate' target: string; // Tool name, agent ID, or endpoint parameters: Record<string, unknown>; // Action parameters (PII-redacted) }; // --- Reasoning --- reasoning: { chainOfThought: string; // Summarized reasoning (not raw LLM output) keyFactors: string[]; // Top factors influencing the decision assumptions: string[]; // Assumptions made during reasoning uncertainties: string[]; // Identified uncertainties }; // --- Context --- context: { inputSummary: string; // Summarized input (PII-redacted) memoriesUsed: Array<{ id: string; source: string; relevanceScore: number; ageHours: number; }>; conversationTurn: number; // Position in conversation environmentState: Record<string, unknown>; // Relevant env variables autonomyLevel: string; // Current OSSA tier }; // --- Alternatives --- alternatives: Array<{ action: string; // Alternative action considered estimatedScore: number; // Score (0-1) rejectionReason: string; // Why this alternative was not selected riskAssessment: string; // 'low' | 'medium' | 'high' }>; // --- Policy Assessment --- policyAssessment: { policiesChecked: Array<{ policyId: string; policyName: string; result: 'pass' | 'fail' | 'warning' | 'not_applicable'; details: string; }>; overallResult: 'compliant' | 'violation' | 'warning'; humanApprovalRequired: boolean; humanApprovalReceived: boolean; }; // --- Confidence --- confidence: { overall: number; // 0.0 to 1.0 breakdown: { informationSufficiency: number; // Is enough context available? actionAppropriateness: number; // Is the chosen action correct? outcomePredictability: number; // Can we predict the outcome? }; }; // --- Outcome (populated after execution) --- outcome: { status: 'success' | 'failure' | 'partial' | 'pending'; resultSummary: string; executionDurationMs: number; tokensConsumed: number; costUsd: number; sideEffects: string[]; // Any unintended consequences feedbackReceived: string | null; // Human feedback if any }; // --- Integrity --- integrity: { previousRecordHash: string; // SHA-256 of previous record (chain) recordHash: string; // SHA-256 of this record signedBy: string; // Agent signing key identifier }; }
4.2 Append-Only Storage and Decision Replay
Decision audit records must be stored in an append-only fashion to ensure tamper resistance and regulatory compliance. The storage system provides the following guarantees:
- Immutability: Once a record is written, it cannot be modified or deleted (except through explicit, audited retention policy expiry).
- Ordering: Records are ordered by UUID v7 timestamps, providing natural chronological ordering.
- Integrity Chain: Each record includes the SHA-256 hash of the previous record, creating a hash chain that detects tampering.
- Decision Replay: The complete context captured in each record enables replaying the agent's decision process for post-hoc analysis.
import { createHash } from 'crypto'; export class DecisionAuditStore { private lastRecordHash: string = '0'.repeat(64); /** * Writes a decision record to append-only storage. * Computes integrity hashes and enforces the hash chain. */ async writeRecord( record: Omit<DecisionAuditRecord, 'integrity'> ): Promise<DecisionAuditRecord> { const integrity = { previousRecordHash: this.lastRecordHash, recordHash: '', // Computed below signedBy: record.agentId, }; // Compute record hash (excluding the hash field itself) const hashInput = JSON.stringify({ ...record, integrity: { ...integrity, recordHash: '', }}); integrity.recordHash = createHash('sha256') .update(hashInput) .digest('hex'); const completeRecord: DecisionAuditRecord = { ...record, integrity, }; // Persist to append-only storage await this.persistRecord(completeRecord); // Update chain this.lastRecordHash = integrity.recordHash; return completeRecord; } /** * Replays all decisions for a given task to reconstruct * the complete decision history. */ async replayDecisions(taskId: string): Promise<DecisionAuditRecord[]> { const records = await this.queryByTaskId(taskId); // Verify hash chain integrity let expectedPreviousHash = '0'.repeat(64); for (const record of records) { if (record.integrity.previousRecordHash !== expectedPreviousHash) { throw new Error( `Hash chain integrity violation at record ${record.id}` ); } expectedPreviousHash = record.integrity.recordHash; } return records; } private async persistRecord(record: DecisionAuditRecord): Promise<void> { // Implementation: PostgreSQL with append-only table, // or object storage (S3/MinIO) with write-once policy } private async queryByTaskId( taskId: string ): Promise<DecisionAuditRecord[]> { return []; } }
4.3 Explainability Framework
Following the framework proposed by Doshi-Velez and Kim (2017), we implement three levels of explanation for agent decisions:
Feature Attribution: Identifies which input features most strongly influenced the decision. For agent systems, this translates to identifying which memories, context elements, or environmental factors drove the selection of a particular action.
Counterfactual Explanation: Answers the question "what would need to change for the agent to have made a different decision?" For example, "if the confidence score on the web search results had been above 0.8 instead of 0.6, the agent would have used those results directly instead of delegating to the research agent."
Contrastive Explanation: Answers the question "why did the agent choose action A instead of action B?" This directly leverages the alternatives recorded in the decision audit trail.
interface ExplainabilityReport { decisionId: string; timestamp: string; // Feature Attribution featureAttribution: Array<{ feature: string; // e.g., "memory.project_context" influence: number; // -1.0 to 1.0 (negative = against decision) description: string; // Human-readable explanation }>; // Counterfactual counterfactual: { question: string; // "What would change the outcome?" minimumChanges: Array<{ feature: string; currentValue: unknown; requiredValue: unknown; alternateOutcome: string; }>; }; // Contrastive contrastive: Array<{ selectedAction: string; alternativeAction: string; differentiatingFactors: Array<{ factor: string; favoredSelected: boolean; // true if factor favored the selected action explanation: string; }>; }>; }
4.4 Decision Audit Pipeline
Diagram 3: Decision Audit Pipeline
+------------------+
| Agent Runtime |
| |
| [Decision Point] |
| | |
| [Build Record] |
| | |
| [Compute Hash] |
| | |
| [Chain Link] |
+--------|----------+
|
v
+------------------+ +------------------+ +------------------+
| Message Queue | | Audit Processor | | Append-Only |
| (Kafka/NATS) |---->| |---->| Storage |
| | | - Validate schema| | |
| topic: | | - Verify chain | | - PostgreSQL |
| agent.decisions | | - PII redaction | | (WORM mode) |
| | | - Enrich context | | - MinIO S3 |
+------------------+ | - Index for search| | (Object Lock) |
+--------|----------+ +------------------+
|
+--------v----------+
| Search Index |
| |
| - Elasticsearch |
| - By agent_id |
| - By task_id |
| - By policy_result |
| - By confidence |
| - By timestamp |
+-------------------+
4.5 Retention Policies
Decision audit trail retention must satisfy industry-specific regulatory requirements:
Table 4: Decision Audit Retention Requirements
+------------------+------------------+---------------------------+-------------------+
| Industry | Retention Period | Regulatory Basis | Storage Tier |
+------------------+------------------+---------------------------+-------------------+
| Financial | 7 years | SEC Rule 17a-4, | Cold storage |
| Services | | MiFID II, SOX | after 90 days |
+------------------+------------------+---------------------------+-------------------+
| Healthcare | 6 years | HIPAA 45 CFR 164.530, | Cold storage |
| | (+ state laws) | State medical records | after 30 days |
+------------------+------------------+---------------------------+-------------------+
| EU AI Act | Duration of | Article 12, Article 20 | Hot storage for |
| (High-Risk) | system lifetime | | active system |
| | + 10 years | | period |
+------------------+------------------+---------------------------+-------------------+
| General / | 2 years | GDPR data minimization, | Hot 30 days, |
| Non-Regulated | | business records | warm 180 days, |
| | | | cold remainder |
+------------------+------------------+---------------------------+-------------------+
| Government / | 10 years | FOIA, Federal Records | Cold storage |
| Public Sector | | Act | after 60 days |
+------------------+------------------+---------------------------+-------------------+
Storage cost optimization is achieved through tiered storage:
Formula 4: Decision Audit Storage Cost
annual_cost = SUM over tiers of: (records_in_tier * avg_record_size * cost_per_GB_tier)
For a fleet of 50 agents making ~100 decisions/hour each:
daily_records = 50 * 100 * 24 = 120,000
avg_record_size = 4 KB (JSON, compressed)
daily_storage = 120,000 * 4 KB = 468 MB/day
Hot tier (30 days): 14 GB * $0.023/GB/month = $0.32/month
Warm tier (150 days): 70.2 GB * $0.0125/GB/month = $0.88/month
Cold tier (remainder): variable * $0.004/GB/month
Annual storage cost (2-year retention): ~$45/year
Annual storage cost (7-year retention): ~$120/year
5. Grafana Dashboards
5.1 Dashboard Architecture
A comprehensive agent observability platform requires multiple specialized dashboards, each targeting a different stakeholder persona and analytical dimension. We define five core dashboards:
Table 5: Dashboard Architecture
+-------------------------+----------------------+---------------------------+
| Dashboard | Primary Audience | Key Questions Answered |
+-------------------------+----------------------+---------------------------+
| Fleet Overview | Operations / | How is the fleet |
| | Platform Team | performing overall? |
| | | Any systemic issues? |
+-------------------------+----------------------+---------------------------+
| Individual Agent | Agent Developers / | How is this specific |
| Performance | Debugging | agent behaving? Any |
| | | regressions? |
+-------------------------+----------------------+---------------------------+
| Token Economics | Finance / Product | How much are we |
| | Management | spending? What is the |
| | | cost per outcome? |
+-------------------------+----------------------+---------------------------+
| Decision Intelligence | Compliance / QA | Are decisions |
| | | compliant? What is |
| | | the confidence trend? |
+-------------------------+----------------------+---------------------------+
| Memory Performance | ML Engineers / | Is retrieval quality |
| | Data Engineers | adequate? Cache |
| | | efficiency? |
+-------------------------+----------------------+---------------------------+
5.2 Fleet Overview Dashboard
The Fleet Overview dashboard provides a single-pane-of-glass view of the entire agent fleet. It includes panels for fleet-wide error rates, task throughput, active agent count, autonomy level distribution, and policy compliance status:
{ "dashboard": { "title": "Agent Fleet Overview", "uid": "agent-fleet-overview", "tags": ["agents", "fleet", "overview"], "timezone": "browser", "refresh": "30s", "time": { "from": "now-6h", "to": "now" }, "templating": { "list": [ { "name": "cluster", "type": "query", "query": "label_values(agent_tasks_total, cluster)", "refresh": 2 } ] }, "panels": [ { "title": "Fleet Task Throughput", "type": "timeseries", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }, "targets": [ { "expr": "sum(rate(agent_tasks_total{cluster=\"$cluster\"}[5m])) by (status)", "legendFormat": "{{ status }}" } ], "fieldConfig": { "defaults": { "custom": { "drawStyle": "line", "fillOpacity": 20, "stacking": { "mode": "normal" } }, "unit": "ops" } } }, { "title": "Fleet Error Rate", "type": "gauge", "gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 }, "targets": [ { "expr": "1 - (sum(rate(agent_tasks_total{status=\"success\", cluster=\"$cluster\"}[5m])) / sum(rate(agent_tasks_total{cluster=\"$cluster\"}[5m])))", "instant": true } ], "fieldConfig": { "defaults": { "unit": "percentunit", "thresholds": { "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 0.02 }, { "color": "red", "value": 0.05 } ] }, "max": 0.2 } } }, { "title": "Active Agents", "type": "stat", "gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 }, "targets": [ { "expr": "count(agent_autonomy_level{cluster=\"$cluster\"} > 0)", "instant": true } ], "fieldConfig": { "defaults": { "unit": "short" } } }, { "title": "Policy Violations (24h)", "type": "stat", "gridPos": { "h": 4, "w": 6, "x": 18, "y": 4 }, "targets": [ { "expr": "sum(increase(agent_policy_violations_total{cluster=\"$cluster\"}[24h]))", "instant": true } ], "fieldConfig": { "defaults": { "unit": "short", "thresholds": { "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 1 }, { "color": "red", "value": 5 } ] } } } }, { "title": "Autonomy Level Distribution", "type": "piechart", "gridPos": { "h": 8, "w": 8, "x": 0, "y": 8 }, "targets": [ { "expr": "count(agent_autonomy_level{cluster=\"$cluster\"}) by (agent_autonomy_level)", "legendFormat": "Tier {{ agent_autonomy_level }}" } ] }, { "title": "Decision Confidence Distribution", "type": "histogram", "gridPos": { "h": 8, "w": 8, "x": 8, "y": 8 }, "targets": [ { "expr": "sum(rate(agent_decision_confidence_bucket{cluster=\"$cluster\"}[5m])) by (le)", "format": "heatmap" } ] }, { "title": "Top Tools by Usage", "type": "bargauge", "gridPos": { "h": 8, "w": 8, "x": 16, "y": 8 }, "targets": [ { "expr": "topk(10, sum(rate(agent_tool_calls_total{cluster=\"$cluster\"}[1h])) by (tool_name))", "legendFormat": "{{ tool_name }}" } ] } ] } }
5.3 Token Economics Dashboard
The Token Economics dashboard tracks LLM API costs across the fleet, enabling financial oversight and cost optimization:
{ "dashboard": { "title": "Agent Token Economics", "uid": "agent-token-economics", "panels": [ { "title": "Hourly Token Cost (USD)", "type": "timeseries", "gridPos": { "h": 8, "w": 16, "x": 0, "y": 0 }, "targets": [ { "expr": "sum(rate(agent_token_cost_dollars[5m])) by (provider) * 3600", "legendFormat": "{{ provider }}" } ], "fieldConfig": { "defaults": { "unit": "currencyUSD", "custom": { "fillOpacity": 30 } } } }, { "title": "Cost Per Successful Task", "type": "timeseries", "gridPos": { "h": 8, "w": 8, "x": 16, "y": 0 }, "targets": [ { "expr": "agent:cost_per_task:5m", "legendFormat": "{{ agent_id }}" } ], "fieldConfig": { "defaults": { "unit": "currencyUSD" } } }, { "title": "Token Usage by Model", "type": "piechart", "gridPos": { "h": 8, "w": 8, "x": 0, "y": 8 }, "targets": [ { "expr": "sum(increase(agent_token_usage_total[24h])) by (model)", "legendFormat": "{{ model }}" } ] }, { "title": "Tokens Per Task Trend", "type": "timeseries", "gridPos": { "h": 8, "w": 16, "x": 8, "y": 8 }, "targets": [ { "expr": "agent:tokens_per_task:5m", "legendFormat": "{{ agent_id }}" } ], "fieldConfig": { "defaults": { "unit": "short" } } } ] } }
5.4 Decision Intelligence Dashboard
The Decision Intelligence dashboard provides compliance officers and QA teams with visibility into the quality and compliance of agent decisions. Key panels include: decision confidence trend over time (line chart showing the rolling average of agent:avg_decision_confidence:5m by agent), policy violation timeline (annotation overlay showing each policy violation event with details), decision type distribution (pie chart breaking down decisions by type), alternatives analysis (table showing recent decisions with the number of alternatives considered and rejection reasons), and explainability drill-down (links to the full decision audit trail for any selected decision).
5.5 Memory Performance Dashboard
The Memory Performance dashboard monitors the quality and efficiency of the agent's knowledge retrieval system. Panels track retrieval latency percentiles (histogram_quantile(0.50|0.95|0.99, rate(agent_memory_retrieval_latency_seconds_bucket[5m]))), cache hit rates (agent:memory_hit_rate:5m), relevance score distributions (histogram of agent_memory_retrieval_relevance), and query volume trends. This dashboard is critical for ML engineers tuning embedding models and index configurations.
6. Logging Architecture
6.1 Structured Logging for Agents
Agent logs must be structured (JSON), correlatable (linked to traces via trace IDs), and PII-safe (sensitive data redacted before storage). The following logging architecture ensures all three properties:
import { context, trace } from '@opentelemetry/api'; interface AgentLogEntry { timestamp: string; level: 'debug' | 'info' | 'warn' | 'error' | 'fatal'; message: string; // Correlation traceId: string; spanId: string; agentId: string; taskId: string; // Structured fields component: string; // 'planner' | 'memory' | 'decision' | 'tool' | 'policy' event: string; // 'task.started' | 'decision.made' | 'tool.invoked' etc. metadata: Record<string, unknown>; // Environment hostname: string; service: string; version: string; environment: string; // 'production' | 'staging' | 'development' } export class AgentLogger { private agentId: string; private service: string; constructor(agentId: string, service: string) { this.agentId = agentId; this.service = service; } log( level: AgentLogEntry['level'], message: string, component: string, event: string, metadata: Record<string, unknown> = {} ): void { const activeSpan = trace.getActiveSpan(); const spanContext = activeSpan?.spanContext(); const entry: AgentLogEntry = { timestamp: new Date().toISOString(), level, message, traceId: spanContext?.traceId || 'no-trace', spanId: spanContext?.spanId || 'no-span', agentId: this.agentId, taskId: (metadata['taskId'] as string) || 'unknown', component, event, metadata: this.redactPII(metadata), hostname: process.env.HOSTNAME || 'unknown', service: this.service, version: process.env.APP_VERSION || 'unknown', environment: process.env.NODE_ENV || 'development', }; // Output as JSON for log aggregator consumption process.stdout.write(JSON.stringify(entry) + '\n'); } /** * Redacts PII from log metadata. * Uses pattern matching to identify and mask sensitive fields. */ private redactPII( metadata: Record<string, unknown> ): Record<string, unknown> { const piiPatterns: Record<string, RegExp> = { email: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, phone: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, ssn: /\b\d{3}-?\d{2}-?\d{4}\b/g, creditCard: /\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b/g, ipAddress: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/g, }; const redacted = { ...metadata }; const sensitiveKeys = [ 'password', 'token', 'secret', 'apiKey', 'api_key', 'authorization', 'credential', 'ssn', 'creditCard', ]; for (const [key, value] of Object.entries(redacted)) { if (sensitiveKeys.some(sk => key.toLowerCase().includes(sk))) { redacted[key] = '[REDACTED]'; continue; } if (typeof value === 'string') { let redactedValue = value; for (const [, pattern] of Object.entries(piiPatterns)) { redactedValue = redactedValue.replace(pattern, '[PII_REDACTED]'); } redacted[key] = redactedValue; } } return redacted; } }
6.2 Log Aggregation with Loki
Grafana Loki provides a cost-effective log aggregation solution optimized for label-based querying rather than full-text indexing. For agent systems, the following label strategy balances query flexibility with index efficiency:
# loki-config.yaml auth_enabled: false server: http_listen_port: 3100 common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks rules_directory: /loki/rules replication_factor: 1 ring: kvstore: store: inmemory schema_config: configs: - from: 2024-01-01 store: tsdb object_store: s3 schema: v13 index: prefix: loki_index_ period: 24h limits_config: retention_period: 720h # 30 days hot max_query_length: 721h max_query_parallelism: 32 max_streams_per_user: 10000 ingestion_rate_mb: 16 ingestion_burst_size_mb: 32 per_stream_rate_limit: 5MB # Label strategy for agent logs # Labels (indexed): level, agent_id, component, environment # Structured metadata (not indexed): trace_id, task_id, event
LogQL queries for common agent debugging scenarios:
# Find all errors for a specific agent in the last hour
{agent_id="orchestrator-01", level="error"} |= ``
# Find decision events with low confidence
{component="decision"} | json | confidence < 0.5
# Correlate logs with a specific trace
{agent_id=~".+"} |= "trace_id=abc123def456"
# Find policy violations across all agents
{component="policy"} |= "violation"
# Token usage anomalies
{component="tool", event="llm.call.completed"} | json | tokens_used > 10000
6.3 PII Redaction Pipeline
PII redaction must occur before logs leave the agent process. The redaction pipeline operates at three levels:
- Field-Level Redaction: Known sensitive field names (password, token, apiKey) are unconditionally redacted.
- Pattern-Level Redaction: Regular expressions detect and mask PII patterns (email addresses, phone numbers, credit card numbers, SSNs) in string values.
- Semantic-Level Redaction: For agent-specific content like chain-of-thought reasoning, an LLM-based classifier identifies and redacts mentions of personal information that may not match standard patterns (names, addresses in natural language).
6.4 Log Storage Estimation
Formula 5: Log Storage Estimation
daily_log_storage = agents * logs_per_hour * avg_log_size_bytes * 24
* (1 - compression_ratio)
For a fleet of 50 agents:
agents = 50
logs_per_hour = 500 (across all components)
avg_log_size_bytes = 512 (JSON structured log)
compression_ratio = 0.7 (Loki's chunked compression)
daily_log_storage = 50 * 500 * 512 * 24 * 0.3
= 92.16 MB/day
monthly_log_storage = 2.77 GB/month
annual_log_storage = 33.6 GB/year (30-day retention: 2.77 GB max hot)
7. Kubernetes Observability Stack
7.1 Stack Components
The complete Kubernetes-native observability stack for agent systems comprises five core components, each deployed as a Helm chart within the monitoring namespace:
Diagram 4: Kubernetes Observability Stack Architecture
+------------------------------------------------------------------+
| Kubernetes Cluster |
+------------------------------------------------------------------+
| |
| +-------------------+ +-------------------+ |
| | Agent Pod 1 | | Agent Pod 2 | ... Agent Pod N |
| | [OTEL SDK] | | [OTEL SDK] | |
| +--------|----------+ +--------|----------+ |
| | | |
| v v |
| +--------------------------------------------------+ |
| | OTEL Collector (DaemonSet) | |
| | | |
| | Receivers: OTLP (gRPC:4317, HTTP:4318) | |
| | Processors: Batch, Memory Limiter, Tail Sampling | |
| | Exporters: Tempo, Prometheus, Loki | |
| +--------------|------------|------------|----------+ |
| | | | |
| +-------v----+ +----v------+ +---v--------+ |
| | Tempo | | Prometheus| | Loki | |
| | (Traces) | | (Metrics) | | (Logs) | |
| | | | | | | |
| | Backend: | | TSDB: | | Storage: | |
| | MinIO S3 | | Local SSD | | MinIO S3 | |
| +------\------+ +----\------+ +---\--------+ |
| \ \ \ |
| +-------------+-------------+ |
| | | |
| v | |
| +------------------+ | |
| | Grafana |<----------------+ |
| | | |
| | Dashboards: | |
| | - Fleet Overview | |
| | - Agent Perf | |
| | - Token Economics| |
| | - Decision Intel | |
| | - Memory Perf | |
| +------------------+ |
+------------------------------------------------------------------+
7.2 Resource Requirements
Table 6: Observability Stack Resource Requirements
+-------------------+--------+---------+---------------------------+
| Component | CPU | Memory | Scaling Factor |
+-------------------+--------+---------+---------------------------+
| Prometheus | 2 CPU | 8 Gi | Per 100K active series |
| Tempo | 1 CPU | 4 Gi | Per 50 traces/sec |
| Loki | 1 CPU | 4 Gi | Per 100 GB/day ingestion |
| OTEL Collector | 0.5 CPU| 1 Gi | Per node (DaemonSet) |
| Grafana | 0.5 CPU| 1 Gi | Single instance (HA: x3) |
| Decision Audit DB | 1 CPU | 4 Gi | Per 500K records/day |
+-------------------+--------+---------+---------------------------+
| TOTAL (minimum) | 6 CPU | 22 Gi | 50-agent fleet baseline |
+-------------------+--------+---------+---------------------------+
7.3 Helm Values
The following Helm values configure the kube-prometheus-stack with agent-specific customizations:
# values-agent-observability.yaml # --- kube-prometheus-stack --- kube-prometheus-stack: prometheus: prometheusSpec: retention: 30d retentionSize: 50GB resources: requests: cpu: 2000m memory: 8Gi limits: cpu: 4000m memory: 16Gi storageSpec: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: ["ReadWriteOnce"] resources: requests: storage: 100Gi additionalScrapeConfigs: - job_name: 'agent-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) ruleSelector: matchLabels: role: agent-monitoring additionalPrometheusRulesMap: agent-recording-rules: groups: - name: agent_performance_rules interval: 30s rules: - record: agent:task_success_rate:5m expr: | sum(rate(agent_tasks_total{status="success"}[5m])) by (agent_id) / sum(rate(agent_tasks_total[5m])) by (agent_id) grafana: enabled: true persistence: enabled: true size: 10Gi dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: 'agent-dashboards' orgId: 1 folder: 'Agent Platform' type: file disableDeletion: false editable: true options: path: /var/lib/grafana/dashboards/agents datasources: datasources.yaml: apiVersion: 1 datasources: - name: Tempo type: tempo url: http://tempo:3100 access: proxy jsonData: tracesToLogsV2: datasourceUid: loki filterByTraceID: true tracesToMetrics: datasourceUid: prometheus - name: Loki type: loki url: http://loki:3100 access: proxy jsonData: derivedFields: - datasourceUid: tempo matcherRegex: "traceId=(\\w+)" name: TraceID url: '$${__value.raw}' # --- Tempo (Traces) --- tempo: tempo: resources: requests: cpu: 1000m memory: 4Gi limits: cpu: 2000m memory: 8Gi storage: trace: backend: s3 s3: bucket: agent-traces endpoint: minio:9000 access_key: ${MINIO_ACCESS_KEY} secret_key: ${MINIO_SECRET_KEY} insecure: true retention: max_duration: 720h # 30 days search: max_duration: 720h # --- Loki (Logs) --- loki: loki: auth_enabled: false storage: type: s3 s3: endpoint: minio:9000 bucketnames: agent-logs access_key_id: ${MINIO_ACCESS_KEY} secret_access_key: ${MINIO_SECRET_KEY} insecure: true limits_config: retention_period: 720h max_streams_per_user: 10000 resources: requests: cpu: 1000m memory: 4Gi limits: cpu: 2000m memory: 8Gi # --- OpenTelemetry Collector --- opentelemetry-collector: mode: daemonset resources: requests: cpu: 500m memory: 1Gi limits: cpu: 1000m memory: 2Gi config: receivers: otlp: protocols: grpc: endpoint: "0.0.0.0:4317" http: endpoint: "0.0.0.0:4318" processors: batch: timeout: 5s send_batch_size: 8192 memory_limiter: check_interval: 5s limit_mib: 1536 spike_limit_mib: 512 exporters: otlp: endpoint: "tempo:4317" tls: insecure: true prometheus: endpoint: "0.0.0.0:8889" loki: endpoint: "http://loki:3100/loki/api/v1/push" service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [otlp] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [prometheus] logs: receivers: [otlp] processors: [memory_limiter, batch] exporters: [loki]
8. Anomaly Detection for Agent Systems
8.1 Statistical Anomaly Detection
Agent behavioral anomalies differ from traditional infrastructure anomalies. A spike in CPU usage is straightforward to detect; a gradual decline in decision quality is not. We employ statistical methods calibrated for the unique distributions of agent metrics.
Z-Score Method: The simplest and most interpretable anomaly detection approach. For each metric, we maintain a rolling window of observations and flag values that deviate significantly from the mean:
Formula 6: Z-Score Anomaly Detection
z(x) = (x - mu) / sigma
Where:
x = current observation
mu = rolling mean (typically 1-hour window)
sigma = rolling standard deviation
Decision rule:
IF |z(x)| > 3 THEN anomaly (99.7% confidence for normal distributions)
IF |z(x)| > 2.5 THEN warning (98.8% confidence)
For agent-specific metrics, we apply z-score detection to: decision confidence (detecting confidence drops), token usage per task (detecting efficiency degradation), memory retrieval latency (detecting backend issues), and tool success rate (detecting tool failures).
interface AnomalyDetector { windowSize: number; // Rolling window in observations observations: number[]; thresholdSigma: number; // Default: 3.0 /** * Adds an observation and returns anomaly assessment. */ observe(value: number): { isAnomaly: boolean; zScore: number; mean: number; stdDev: number; direction: 'high' | 'low' | 'normal'; }; } export class RollingZScoreDetector implements AnomalyDetector { windowSize: number; observations: number[] = []; thresholdSigma: number; constructor(windowSize: number = 100, thresholdSigma: number = 3.0) { this.windowSize = windowSize; this.thresholdSigma = thresholdSigma; } observe(value: number): { isAnomaly: boolean; zScore: number; mean: number; stdDev: number; direction: 'high' | 'low' | 'normal'; } { this.observations.push(value); if (this.observations.length > this.windowSize) { this.observations.shift(); } if (this.observations.length < 10) { return { isAnomaly: false, zScore: 0, mean: value, stdDev: 0, direction: 'normal', }; } const mean = this.observations.reduce((a, b) => a + b, 0) / this.observations.length; const variance = this.observations.reduce( (sum, x) => sum + Math.pow(x - mean, 2), 0 ) / this.observations.length; const stdDev = Math.sqrt(variance); if (stdDev === 0) { return { isAnomaly: false, zScore: 0, mean, stdDev, direction: 'normal' }; } const zScore = (value - mean) / stdDev; const isAnomaly = Math.abs(zScore) > this.thresholdSigma; return { isAnomaly, zScore, mean, stdDev, direction: zScore > this.thresholdSigma ? 'high' : zScore < -this.thresholdSigma ? 'low' : 'normal', }; } }
8.2 Automated Circuit Breakers
When anomalies are detected, automated circuit breakers protect the system from cascading failures. Agent-specific circuit breakers operate at three levels:
Agent-Level Circuit Breaker: If an individual agent's error rate exceeds the threshold, the circuit breaker transitions the agent from "closed" (operating normally) to "open" (all requests rejected) state. After a configurable timeout, the circuit enters "half-open" state, where a limited number of requests are allowed to determine if the underlying issue has resolved.
Tool-Level Circuit Breaker: If a specific tool's failure rate exceeds the threshold, the circuit breaker prevents agents from invoking that tool, forcing them to select alternative tools or escalate to human operators.
Fleet-Level Circuit Breaker: If the fleet-wide anomaly rate exceeds the threshold, the fleet circuit breaker triggers a global pause, preventing all agents from making autonomous decisions until the situation is assessed.
Diagram 5: Circuit Breaker State Machine
error_rate > threshold
+--------+ +-----------------------+ +--------+
| | | | | |
| CLOSED |---->| OPEN |---->| HALF |
| | | (reject all requests)| | OPEN |
|(normal)| | | |(test) |
| |<----| timeout expires | | |
+--------+ +-----------------------+ +---+----+
^ |
| success_rate > recovery |
+-------------------------------------------+
| failure detected |
+-------------------------------------------+
|
v
+--------+
| OPEN |
+--------+
8.3 Self-Healing Patterns
Beyond circuit breaking, agent systems can implement self-healing behaviors that automatically remediate common failure modes:
- Memory Index Rebuild: When memory retrieval relevance scores consistently fall below a threshold, the system triggers an automatic re-indexing of the vector database.
- Model Fallback: When the primary LLM provider experiences latency spikes or error rate increases, the system automatically routes requests to a fallback provider.
- Autonomy Degradation: When policy violations are detected, the system automatically reduces the agent's autonomy level, requiring human approval for subsequent decisions.
- Context Window Pruning: When token usage per task exceeds the anomaly threshold, the system automatically prunes low-relevance memories from the context window.
9. Cost of Observability
9.1 Total Cost Model
Observability infrastructure is not free. For agent systems, the cost of observability can represent a meaningful fraction of the total operational budget, particularly when decision audit trails require long-term retention. The total cost of observability is modeled as:
Formula 7: Total Observability Cost
obs_cost = trace_cost + metric_cost + log_cost + audit_cost + compute_cost
Where:
trace_cost = daily_trace_volume * storage_cost_per_GB * retention_days / 30
metric_cost = active_series * cost_per_series_per_month
log_cost = daily_log_volume * storage_cost_per_GB * retention_days / 30
audit_cost = daily_audit_volume * tiered_storage_cost * retention_years
compute_cost = sum(component_cpu * cpu_cost + component_mem * mem_cost)
9.2 Cost Breakdown for a 50-Agent Fleet
Table 7: Monthly Observability Cost (50-Agent Fleet, Self-Hosted)
+-------------------------+-------------+------------------+------------------+
| Component | Volume | Unit Cost | Monthly Cost |
+-------------------------+-------------+------------------+------------------+
| Traces (Tempo + MinIO) | 127.5 GB | $0.023/GB | $2.93 |
| Metrics (Prometheus) | 45K series | Self-hosted | $0 (compute only)|
| Logs (Loki + MinIO) | 2.77 GB | $0.023/GB | $0.06 |
| Decision Audit (PG+S3) | 14 GB hot | $0.023/GB hot | $0.44 |
| | + cold tier | $0.004/GB cold | |
+-------------------------+-------------+------------------+------------------+
| Compute (K8s) | | | |
| Prometheus | 2 CPU, 8 Gi | $0.05/CPU-hr, | $108.00 |
| Tempo | 1 CPU, 4 Gi | $0.007/Gi-hr | $55.44 |
| Loki | 1 CPU, 4 Gi | | $55.44 |
| OTEL Collector (x3) | 1.5 CPU,3 Gi| | $69.12 |
| Grafana | 0.5 CPU,1 Gi| | $29.16 |
| Audit DB | 1 CPU, 4 Gi | | $55.44 |
+-------------------------+-------------+------------------+------------------+
| TOTAL MONTHLY | | | ~$376/month |
+-------------------------+-------------+------------------+------------------+
| Per Agent Per Month | | | ~$7.52/agent |
+-------------------------+-------------+------------------+------------------+
9.3 Cost Optimization Strategies
- Aggressive Sampling: Reduce trace sampling from 10% to 5% for stable agents, saving approximately 50% on trace storage costs.
- Metric Aggregation: Use recording rules to pre-aggregate fleet-level metrics, allowing shorter retention of raw per-agent series.
- Log Level Tuning: In production, set log level to "warn" for stable components, reducing log volume by 60-80%.
- Cold Storage Tiering: Move decision audit records older than 30 days to S3 Glacier or equivalent, reducing storage costs by 80%.
- Dynamic Sampling: Increase sampling rates only when anomalies are detected, keeping baseline costs low while maintaining visibility during incidents.
9.4 Observability Cost as Percentage of Total Agent Cost
For context, if a 50-agent fleet consumes approximately $5,000/month in LLM API costs (the primary operational expense), the observability infrastructure at $376/month represents approximately 7.5% of total operating cost. This is within the generally accepted 5-15% range for observability overhead in production systems. Organizations should target keeping observability costs below 10% of the total agent operational budget.
10. Implementation Roadmap
Phase 1: Foundation (Weeks 1-4)
Deploy the core observability stack (Prometheus, Loki, Grafana) with basic agent metrics. Instrument the top 3 most critical agent types with OpenTelemetry custom spans. Deploy the OTEL Collector with head-based sampling at 10%.
Phase 2: Decision Auditing (Weeks 5-8)
Implement the Decision Audit Record schema and append-only storage backend. Deploy the audit pipeline with PII redaction. Create the Decision Intelligence dashboard. Implement basic z-score anomaly detection for decision confidence.
Phase 3: Advanced Tracing (Weeks 9-12)
Enable tail-based sampling with priority scoring. Deploy Tempo for distributed trace storage and search. Implement trace context propagation across all inter-agent communication paths. Create the Fleet Overview and Token Economics dashboards.
Phase 4: Intelligence (Weeks 13-16)
Deploy automated circuit breakers at agent, tool, and fleet levels. Implement self-healing patterns (memory rebuild, model fallback, autonomy degradation). Add explainability reports (feature attribution, counterfactual, contrastive). Optimize costs through dynamic sampling and storage tiering.
Diagram 6: Implementation Phasing
Phase 1 (Wk 1-4) Phase 2 (Wk 5-8) Phase 3 (Wk 9-12) Phase 4 (Wk 13-16)
+-----------------+ +-----------------+ +-----------------+ +-----------------+
| FOUNDATION | | DECISION AUDIT | | ADVANCED TRACE | | INTELLIGENCE |
| | | | | | | |
| - Prometheus |---->| - Audit Schema |---->| - Tail Sampling |---->| - Circuit Break |
| - Loki | | - Append Store | | - Tempo Deploy | | - Self-Healing |
| - Grafana | | - PII Redaction | | - Context Prop | | - Explainability|
| - Basic Metrics | | - Decision Dash | | - Fleet Dash | | - Cost Optimize |
| - OTEL SDK | | - Z-Score Detect| | - Token Dash | | - Dynamic Sample|
+-----------------+ +-----------------+ +-----------------+ +-----------------+
Maturity: L1 Basic L2 Structured L3 Correlated L4 Intelligent
11. Conclusion
Agent observability represents a paradigm shift from traditional APM. Where conventional monitoring asks "is the system healthy?", agent observability asks "is the system making good decisions?" This distinction has profound implications for instrumentation strategy, storage architecture, and operational workflows.
The framework presented in this whitepaper addresses these challenges through four integrated pillars: structured logs with PII redaction and correlation IDs, custom Prometheus metrics with cardinality-aware design, OpenTelemetry traces with agent-specific semantic conventions and priority-based sampling, and Decision Audit Trails that satisfy both regulatory compliance and operational learning requirements.
Key takeaways for engineering teams implementing agent observability:
-
Instrument the cognitive pipeline, not just the API layer. Traditional APM instrumentation at service boundaries misses the most important signals in agent systems. Custom spans for planning, deciding, memory retrieval, and policy checking provide the visibility needed to debug and optimize agent behavior.
-
Design for explainability from day one. Retrofitting decision audit trails into an existing agent system is vastly more expensive than building them into the initial architecture. The Decision Audit Record schema provides a structured foundation that supports both real-time monitoring and post-hoc analysis.
-
Manage cardinality proactively. Agent systems naturally generate high-cardinality metric labels (unique agent IDs, task IDs, tool combinations). Without proactive cardinality budgeting, Prometheus performance will degrade rapidly.
-
Budget 7-10% of operational cost for observability. For a self-hosted Kubernetes stack monitoring a 50-agent fleet, expect approximately $376/month in infrastructure costs, or roughly $7.52 per agent per month.
-
Implement anomaly detection before you need it. Agent behavioral anomalies are subtle and can go undetected for extended periods. Rolling z-score detectors with automated circuit breakers provide a safety net that prevents cascading failures.
The transition from monitoring to observability to operational intelligence is not merely a tooling upgrade -- it is a fundamental change in how we understand and manage autonomous systems. As agent fleets grow in scale and sophistication, the observability infrastructure described in this whitepaper will become as essential as the agents themselves.
References
-
Sridharan, C. (2018). Distributed Systems Observability. O'Reilly Media. Free eBook. A foundational text establishing the three pillars of observability (logs, metrics, traces) for distributed systems.
-
Doshi-Velez, F., & Kim, B. (2017). "Towards A Rigorous Science of Interpretable Machine Learning." arXiv:1702.08608. Provides the theoretical framework for feature attribution, counterfactual, and contrastive explanations referenced in Section 4.
-
European Parliament. (2024). "Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act)." Official Journal of the European Union. EUR-Lex. The complete legal text establishing AI transparency and logging requirements.
-
OpenTelemetry Authors. (2024). "OpenTelemetry Specification v1.30." opentelemetry.io | GitHub. The canonical specification for telemetry collection, propagation, and export used throughout this paper.
-
Google SRE Team. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media. Free Online. Establishes SLI/SLO frameworks and error budgets referenced in the alerting section.
-
Beyer, B., Murphy, N.R., Rensin, D.K., Kawahara, K., & Thorne, S. (2018). The Site Reliability Workbook. O'Reilly Media. Free Online. Practical implementation guidance for monitoring and alerting systems.
-
Prometheus Authors. (2024). "Prometheus Documentation: Recording Rules." prometheus.io | GitHub. Technical reference for recording rule configuration and PromQL query optimization.
-
Grafana Labs. (2024). "Grafana Tempo Documentation." grafana.com/docs/tempo | GitHub. Architecture and configuration reference for the distributed tracing backend used in our stack.
-
Grafana Labs. (2024). "Grafana Loki Documentation." grafana.com/docs/loki | GitHub. Log aggregation system documentation including LogQL query language reference.
-
W3C. (2024). "Trace Context - W3C Recommendation." w3.org/TR/trace-context. The W3C standard for distributed trace context propagation used for inter-agent tracing.
-
Nygard, M. (2018). Release It! Design and Deploy Production-Ready Software. 2nd Edition. Pragmatic Bookshelf. ISBN: 978-1680502398. Publisher. Introduces the circuit breaker pattern adapted for agent systems in Section 8.
-
Majors, C., Fong-Jones, L., & Miranda, G. (2022). Observability Engineering. O'Reilly Media. ISBN: 978-1492076445. O'Reilly. Modern observability practices including tail-based sampling strategies.
-
Burns, B. (2018). Designing Distributed Systems. O'Reilly Media. ISBN: 978-1491983645. Free eBook. Patterns for distributed system design including sidecar and ambassador patterns used in collector deployment.
-
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly Media. ISBN: 978-1449373320. Publisher. Foundational reference for append-only storage, event sourcing, and data integrity verification.
-
Fowler, M. (2014). "Circuit Breaker." martinfowler.com. Original description of the circuit breaker pattern adapted for agent-level, tool-level, and fleet-level application.
-
Ribeiro, M.T., Singh, S., & Guestrin, C. (2016). "'Why Should I Trust You?': Explaining the Predictions of Any Classifier." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. DOI:10.1145/2939672.2939778 | GitHub. LIME explanations framework informing our explainability approach.
-
Lundberg, S.M., & Lee, S.I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems 30. arXiv:1705.07874 | GitHub. SHAP values for feature attribution referenced in the explainability framework.
-
CNCF. (2024). "Cloud Native Computing Foundation Landscape: Observability and Analysis." landscape.cncf.io. Overview of the cloud-native observability ecosystem and tool categorization.
-
Kubernetes Authors. (2024). "Kubernetes Documentation: Monitoring, Logging, and Debugging." kubernetes.io. Reference for Kubernetes-native monitoring patterns and resource specifications.
-
HashiCorp. (2024). "Consul Service Mesh: Observability." consul.io. Service mesh observability patterns informing the agent mesh monitoring approach.
-
NIST. (2023). "Artificial Intelligence Risk Management Framework (AI RMF 1.0)." NIST | PDF. US framework for AI risk management including monitoring and audit trail requirements.
-
ISO/IEC. (2023). "ISO/IEC 42001:2023 - Artificial intelligence -- Management system." ISO Catalog. International standard for AI management systems including observability requirements.
-
Li, B., Qi, P., Liu, B., Di, S., Liu, J., Pei, J., Yi, J., & Zhou, B. (2023). "Trustworthy AI: From Principles to Practices." ACM Computing Surveys. DOI:10.1145/3555803. Comprehensive survey of trustworthy AI practices including transparency and explainability.
-
Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems 28. Paper. Seminal paper on ML system monitoring and technical debt, applicable to agent observability.
-
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction." IEEE International Conference on Big Data. DOI:10.1109/BigData.2017.8258038. Testing and monitoring rubric adapted for agent system readiness assessment.
-
Paleyes, A., Urma, R.G., & Lawrence, N.D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys. arXiv:2011.09926 | DOI:10.1145/3533378. Survey of deployment challenges including observability gaps in production ML systems.
-
OpenTelemetry Authors. (2024). "OpenTelemetry Collector Contrib: Tail Sampling Processor." GitHub. Implementation reference for tail-based sampling policies used in Section 2.
This whitepaper is part of the Bluefly Agent Platform Technical Series. For related topics, see Whitepaper 07 (Agent Security) and Whitepaper 09 (Agent Economics and Cost Optimization).
Copyright 2026 Bluefly Platform Engineering. All rights reserved.