Skip to main content

Agent Tracer - AI Operations Intelligence

Agent Tracer - AI Operations Intelligence

Agent Tracer is the unified observability platform for AI agents and LLM workflows across the LLM platform. It provides comprehensive tracing, metrics, and analytics for autonomous agent operations.


Overview

Agent Tracer combines AI-specific observability (Phoenix Arize), distributed tracing (Jaeger/Tempo), correlation analysis (Neo4j), and intelligent performance tracking into a single platform.

Key Capabilities

  • AI-Specific Tracing: LLM call tracking, token usage, prompt analysis
  • Distributed Tracing: OpenTelemetry-based cross-service tracing
  • Correlation Engine: Neo4j-powered relationship discovery
  • ACE: AI Capabilities Engine for performance scoring
  • ATLAS: Agent Tracing and Learning Analytics System
  • Metrics: Prometheus/Grafana integration
  • Alerting: Proactive issue detection

Architecture

graph TB subgraph "AI Agents" A1[TDD Enforcer] A2[API Builder] A3[Doc Sync] A4[Security Audit] end subgraph "Agent Tracer Core" AT[Agent Tracer] ACE[ACE Engine] ATLAS[ATLAS Analytics] end subgraph "Observability Stack" PHX[Phoenix Arize] JAE[Jaeger] NEO[Neo4j] PROM[Prometheus] end subgraph "Visualization" GRA[Grafana] UI[Phoenix UI] end A1 --> AT A2 --> AT A3 --> AT A4 --> AT AT --> PHX AT --> JAE AT --> NEO AT --> PROM ACE --> AT ATLAS --> AT PHX --> UI PROM --> GRA JAE --> GRA

Installation

NPM Package

# Install agent-tracer npm install @bluefly/agent-tracer # Or globally npm install -g @bluefly/agent-tracer # Verify installation agent-tracer --version

Docker Stack

# Clone repository git clone https://gitlab.com/blueflyio/agent-platform/agent-tracer.git cd agent-tracer # Start observability stack docker-compose up -d # Verify services docker-compose ps # Services started: # - Phoenix Arize: http://localhost:6006 # - Jaeger UI: http://localhost:16686 # - Neo4j Browser: http://localhost:7474 # - Prometheus: http://localhost:9090 # - Grafana: http://localhost:3000 # - ACE Server: http://localhost:3008 # - ATLAS Server: http://localhost:3009

Kubernetes Deployment

# Deploy via Helm helm install agent-tracer ./infrastructure/helm/agent-tracer \ --namespace observability \ --create-namespace # Verify deployment kubectl get pods -n observability # Access services kubectl port-forward -n observability svc/phoenix 6006:6006 kubectl port-forward -n observability svc/jaeger 16686:16686

Configuration

Environment Variables

# Service Ports AGENT_TRACER_PORT=3007 AGENT_TRACER_ACE_PORT=3008 AGENT_TRACER_ATLAS_PORT=3009 # Phoenix Arize PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006 PHOENIX_PROJECT=llm-agents # Jaeger JAEGER_AGENT_HOST=localhost JAEGER_AGENT_PORT=6831 JAEGER_COLLECTOR_ENDPOINT=http://localhost:14268/api/traces # Tempo TEMPO_ENDPOINT=http://localhost:4317 # Prometheus PROMETHEUS_PUSHGATEWAY=http://localhost:9091 # Neo4j (Correlation) NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password # Qdrant (Vector Storage) QDRANT_URL=http://localhost:6333 QDRANT_API_KEY=your-api-key # Loki (Logging) LOKI_URL=http://localhost:3100 # Alertmanager ALERTMANAGER_URL=http://localhost:9093

Tracer Configuration

config/tracer.yaml:

tracer: serviceName: agent-tracer enabled: true samplingRate: 1.0 exporters: phoenix: enabled: true endpoint: http://localhost:6006 jaeger: enabled: true endpoint: http://localhost:14268/api/traces tempo: enabled: true endpoint: http://localhost:4317 metrics: enabled: true prometheus: port: 9090 path: /metrics logging: level: info format: json loki: enabled: true endpoint: http://localhost:3100

Usage

Initialize Tracing in Agent

import { AgentTracer } from '@bluefly/agent-tracer' // Initialize tracer const tracer = new AgentTracer({ serviceName: 'tdd-enforcer', phoenixEndpoint: 'http://localhost:6006', jaegerEndpoint: 'http://localhost:14268/api/traces', neo4jUri: 'bolt://localhost:7687' }) // Start span for agent task const span = tracer.startSpan('enforce-tdd', { taskId: 'task-123', agentId: 'tdd-enforcer-001', projectId: 'llm-platform' }) try { // Execute agent task const result = await enforceTDD() span.setAttribute('tests.created', result.testsCreated) span.setAttribute('coverage.percentage', result.coverage) span.setStatus({ code: SpanStatusCode.OK }) return result } catch (error) { span.recordException(error) span.setStatus({ code: SpanStatusCode.ERROR }) throw error } finally { span.end() }

Trace LLM Calls

import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix' const phoenix = new PhoenixTracer({ endpoint: 'http://localhost:6006', project: 'llm-agents' }) // Trace Claude API call const result = await phoenix.traceLLMCall({ model: 'claude-sonnet-4-5-20250929', prompt: 'Generate tests for AuthService.ts', provider: 'anthropic', metadata: { agentId: 'tdd-enforcer-001', taskType: 'test-generation' } }) console.log('Tokens:', result.usage.totalTokens) console.log('Cost:', result.cost) console.log('Latency:', result.latency)

Record Metrics

import { metrics } from '@bluefly/agent-tracer' // Counter metrics.counter('agent.tasks.completed', { agentId: 'api-builder', status: 'success' }) // Gauge metrics.gauge('agent.queue.size', 42, { agentId: 'doc-sync' }) // Histogram metrics.histogram('agent.task.duration', 1250, { agentId: 'security-audit', taskType: 'vulnerability-scan' })

CloudEvents for Activity Streams

OSSA activity streams follow the CloudEvents specification for interoperability with event-driven systems (SigNoz, Kafka, EventBridge).

import { CloudEventEmitter } from '@bluefly/agent-tracer/cloudevents' const emitter = new CloudEventEmitter({ source: '/agents/review-agent/instance-123', type_prefix: 'io.ossa.agent' }) // Emit interaction completed event await emitter.emit({ type: 'io.ossa.agent.interaction.completed', data: { agent_id: 'review-agent', instance_id: 'uuid-instance', session_id: 'uuid-session', interaction_id: 'uuid-interaction', model: 'claude-sonnet-4-20250514', input_tokens: 1523, output_tokens: 892, latency_ms: 1250, finish_reason: 'stop', capabilities_used: ['code_review', 'security_analysis'] } })

CloudEvents Wire Format:

{ "specversion": "1.0", "type": "io.ossa.agent.interaction.completed", "source": "/agents/review-agent/instance-123", "id": "550e8400-e29b-41d4-a716-446655440000", "time": "2025-12-04T10:30:00.000Z", "datacontenttype": "application/json", "subject": "session/uuid-session/interaction/uuid-interaction", "data": { "agent_id": "review-agent", "instance_id": "uuid-instance", "session_id": "uuid-session", "interaction_id": "uuid-interaction", "model": "claude-sonnet-4-20250514", "input_tokens": 1523, "output_tokens": 892, "latency_ms": 1250, "finish_reason": "stop", "capabilities_used": ["code_review", "security_analysis"] } }

Standard Event Types:

Event TypeTrigger
io.ossa.agent.startedAgent instance initialized
io.ossa.agent.session.createdNew conversation session
io.ossa.agent.interaction.startedPrompt received
io.ossa.agent.interaction.completedResponse generated
io.ossa.agent.interaction.failedError during generation
io.ossa.agent.capability.invokedTool/capability called
io.ossa.agent.handoff.initiatedAgent-to-agent delegation
io.ossa.agent.stoppedAgent instance terminated

Metrics Cardinality Controls

Critical for Prometheus - Prevent label explosion with high-cardinality attributes.

import { AgentTracer, CardinalityConfig } from '@bluefly/agent-tracer' const tracer = new AgentTracer({ cardinality: { // Allowed labels (low cardinality) allowed_labels: [ 'agent_name', // ~50 unique values 'capability_name', // ~100 unique values 'status', // success | error | timeout 'model', // ~10 unique values 'environment' // dev | staging | production ], // Forbidden labels (high cardinality - use traces instead) forbidden_labels: [ 'instance_id', // Thousands of values 'session_id', // Millions of values 'interaction_id', // Billions of values 'user_id', // High cardinality 'request_id' // High cardinality ], // Transform high-cardinality to buckets bucket_transforms: { 'latency_ms': [50, 100, 250, 500, 1000, 2500, 5000], 'token_count': [100, 500, 1000, 5000, 10000, 50000] } } })

Rule: Use metrics for aggregates (agent_name, status), traces for specifics (instance_id, session_id).


CLI Commands

Start Services

# Start ACE server agent-tracer ace start # ACE server running on http://localhost:3008 # Start ATLAS server agent-tracer atlas start # ATLAS server running on http://localhost:3009 # Start main tracer agent-tracer start # Tracer running on http://localhost:3007

View Traces

# List recent traces agent-tracer traces list --limit 10 # Get specific trace agent-tracer traces get --trace-id abc123 # Export traces agent-tracer traces export --output traces.json # Search traces agent-tracer traces search --service tdd-enforcer --status error

ACE Commands

# Score agent performance agent-tracer ace score --agent-id tdd-enforcer-001 # Benchmark multiple agents agent-tracer ace benchmark --agents tdd-enforcer,api-builder,doc-sync # View capabilities agent-tracer ace capabilities --agent-id tdd-enforcer-001 # Generate performance report agent-tracer ace report --agent-id tdd-enforcer-001 --output report.html

ATLAS Commands

# Analyze agent performance agent-tracer atlas analyze --agent-id tdd-enforcer-001 # Optimize workflow agent-tracer atlas optimize --workflow-id tdd-workflow-v1 # View historical trends agent-tracer atlas trends --agent-id tdd-enforcer-001 --days 30

Correlation Analysis

# Find correlations for trace agent-tracer correlate --trace-id abc123 # Root cause analysis agent-tracer rca --incident-id incident-789 # Impact analysis agent-tracer impact --service tdd-enforcer

ACE (AI Capabilities Engine)

Performance Scoring

ACE scores agent performance across multiple dimensions:

$ agent-tracer ace score --agent-id tdd-enforcer-001 ACE Performance Score Agent: tdd-enforcer-001 Period: Last 24 hours Overall Score: 88/100 Component Scores: Quality: 92/100 - Test coverage: 95% - Test quality: 90% - TDD compliance: 91% Efficiency: 85/100 - Task completion: 95% - Average latency: 1.2s - Token usage: 85k (optimal) Reliability: 87/100 - Success rate: 98% - Error rate: 2% - Uptime: 99.9% Recommendations: - Reduce token usage on simple tasks (-10%) - Improve error handling for edge cases - Cache common test patterns

Capability Matrix

$ agent-tracer ace capabilities --agent-id tdd-enforcer-001 Agent Capabilities Capability | Level | Confidence ------------------------|-------|------------ Test Generation | 95% | High Coverage Analysis | 92% | High TDD Enforcement | 88% | Medium Code Quality Check | 85% | Medium Security Validation | 78% | Medium Performance Testing | 65% | Low Strengths: Excellent at generating comprehensive test suites High accuracy in coverage analysis Strong TDD compliance enforcement Areas for Improvement: Security test generation needs work Performance test coverage is low

ATLAS (Agent Tracing & Learning Analytics)

Learning Analytics

$ agent-tracer atlas analyze --agent-id tdd-enforcer-001 ATLAS Learning Analytics Agent: tdd-enforcer-001 Analysis Period: 30 days Learning Progress: Task Success Rate: 87% 98% (+11%) Average Quality Score: 75 92 (+17) Token Efficiency: 120k 85k (-35k) Latency: 2.1s 1.2s (-0.9s) Key Learnings: Improved test pattern recognition Better context understanding Optimized prompt engineering Enhanced error recovery Optimization Opportunities: - Cache frequently used test templates - Pre-process common file patterns - Batch similar tasks for efficiency

Workflow Optimization

$ agent-tracer atlas optimize --workflow-id tdd-workflow-v1 Workflow Optimization Report Workflow: tdd-workflow-v1 Agents: tdd-enforcer, api-builder, doc-sync Bottlenecks Identified: 1. TDD Enforcer API Builder handoff (2.3s avg) 2. API Builder test validation (1.8s avg) 3. Doc Sync git operations (1.2s avg) Recommendations: Parallelize TDD enforcement and doc sync Cache API Builder validation results Use git worktrees for faster operations Estimated Improvement: -3.5s (-45%)

Dashboards

Pre-built Grafana Dashboards

Located in dashboard/:

  1. Agent Overview - High-level metrics for all agents
  2. LLM Performance - Model usage, costs, token tracking
  3. Trace Analysis - Distributed trace visualization
  4. ACE Scores - Agent capability scores over time
  5. ATLAS Analytics - Learning progress and optimization
  6. Infrastructure - System health and resource usage

Import Dashboards

# Import all dashboards curl -X POST http://localhost:3000/api/dashboards/import \ -H "Content-Type: application/json" \ -d @dashboard/agent-overview.json # Or via CLI agent-tracer dashboards import --all

Alerting

Alert Rules

Pre-configured in infrastructure/prometheus/alerts.yml:

groups: - name: agent_alerts rules: - alert: HighAgentErrorRate expr: rate(agent_errors_total[5m]) > 0.05 for: 10m labels: severity: warning annotations: summary: "High error rate for {{ $labels.agent_id }}" - alert: HighLLMCost expr: sum(llm_cost_usd) > 100 for: 1d labels: severity: critical annotations: summary: "Daily LLM cost exceeded $100" - alert: LowAgentQualityScore expr: agent_quality_score < 0.7 for: 1h labels: severity: warning annotations: summary: "Agent {{ $labels.agent_id }} quality score below threshold"

Configure Alertmanager

# alertmanager.yml global: slack_api_url: 'https://hooks.slack.com/services/...' route: group_by: ['alertname', 'agent_id'] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#agent-alerts' title: 'Agent Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Integration with Buildkit

Automatic Instrumentation

# BuildKit automatically instruments agents buildkit agents deploy tdd-enforcer --with-tracing # View agent traces buildkit agents traces tdd-enforcer-001 # View agent metrics buildkit agents metrics tdd-enforcer-001 # ACE score buildkit agents score tdd-enforcer-001

API Reference

Full API documentation: OpenAPI Spec

Key Endpoints

Tracing

  • POST /api/v1/traces - Submit trace
  • GET /api/v1/traces/:id - Get trace
  • GET /api/v1/traces/search - Search traces

ACE

  • POST /api/v1/ace/score - Score agent
  • GET /api/v1/ace/benchmarks - List benchmarks
  • POST /api/v1/ace/capabilities - Get capabilities

ATLAS

  • GET /api/v1/atlas/analytics/:agentId - Get analytics
  • POST /api/v1/atlas/optimize - Optimize workflow
  • GET /api/v1/atlas/trends - Historical trends

Metrics

  • GET /metrics - Prometheus metrics
  • POST /api/v1/metrics/custom - Submit custom metric