Agent Tracer - AI Operations Intelligence
Agent Tracer - AI Operations Intelligence
Agent Tracer is the unified observability platform for AI agents and LLM workflows across the LLM platform. It provides comprehensive tracing, metrics, and analytics for autonomous agent operations.
Overview
Agent Tracer combines AI-specific observability (Phoenix Arize), distributed tracing (Jaeger/Tempo), correlation analysis (Neo4j), and intelligent performance tracking into a single platform.
Key Capabilities
- AI-Specific Tracing: LLM call tracking, token usage, prompt analysis
- Distributed Tracing: OpenTelemetry-based cross-service tracing
- Correlation Engine: Neo4j-powered relationship discovery
- ACE: AI Capabilities Engine for performance scoring
- ATLAS: Agent Tracing and Learning Analytics System
- Metrics: Prometheus/Grafana integration
- Alerting: Proactive issue detection
Architecture
graph TB subgraph "AI Agents" A1[TDD Enforcer] A2[API Builder] A3[Doc Sync] A4[Security Audit] end subgraph "Agent Tracer Core" AT[Agent Tracer] ACE[ACE Engine] ATLAS[ATLAS Analytics] end subgraph "Observability Stack" PHX[Phoenix Arize] JAE[Jaeger] NEO[Neo4j] PROM[Prometheus] end subgraph "Visualization" GRA[Grafana] UI[Phoenix UI] end A1 --> AT A2 --> AT A3 --> AT A4 --> AT AT --> PHX AT --> JAE AT --> NEO AT --> PROM ACE --> AT ATLAS --> AT PHX --> UI PROM --> GRA JAE --> GRA
Installation
NPM Package
# Install agent-tracer npm install @bluefly/agent-tracer # Or globally npm install -g @bluefly/agent-tracer # Verify installation agent-tracer --version
Docker Stack
# Clone repository git clone https://gitlab.com/blueflyio/agent-platform/agent-tracer.git cd agent-tracer # Start observability stack docker-compose up -d # Verify services docker-compose ps # Services started: # - Phoenix Arize: http://localhost:6006 # - Jaeger UI: http://localhost:16686 # - Neo4j Browser: http://localhost:7474 # - Prometheus: http://localhost:9090 # - Grafana: http://localhost:3000 # - ACE Server: http://localhost:3008 # - ATLAS Server: http://localhost:3009
Kubernetes Deployment
# Deploy via Helm helm install agent-tracer ./infrastructure/helm/agent-tracer \ --namespace observability \ --create-namespace # Verify deployment kubectl get pods -n observability # Access services kubectl port-forward -n observability svc/phoenix 6006:6006 kubectl port-forward -n observability svc/jaeger 16686:16686
Configuration
Environment Variables
# Service Ports AGENT_TRACER_PORT=3007 AGENT_TRACER_ACE_PORT=3008 AGENT_TRACER_ATLAS_PORT=3009 # Phoenix Arize PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006 PHOENIX_PROJECT=llm-agents # Jaeger JAEGER_AGENT_HOST=localhost JAEGER_AGENT_PORT=6831 JAEGER_COLLECTOR_ENDPOINT=http://localhost:14268/api/traces # Tempo TEMPO_ENDPOINT=http://localhost:4317 # Prometheus PROMETHEUS_PUSHGATEWAY=http://localhost:9091 # Neo4j (Correlation) NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password # Qdrant (Vector Storage) QDRANT_URL=http://localhost:6333 QDRANT_API_KEY=your-api-key # Loki (Logging) LOKI_URL=http://localhost:3100 # Alertmanager ALERTMANAGER_URL=http://localhost:9093
Tracer Configuration
config/tracer.yaml:
tracer: serviceName: agent-tracer enabled: true samplingRate: 1.0 exporters: phoenix: enabled: true endpoint: http://localhost:6006 jaeger: enabled: true endpoint: http://localhost:14268/api/traces tempo: enabled: true endpoint: http://localhost:4317 metrics: enabled: true prometheus: port: 9090 path: /metrics logging: level: info format: json loki: enabled: true endpoint: http://localhost:3100
Usage
Initialize Tracing in Agent
import { AgentTracer } from '@bluefly/agent-tracer' // Initialize tracer const tracer = new AgentTracer({ serviceName: 'tdd-enforcer', phoenixEndpoint: 'http://localhost:6006', jaegerEndpoint: 'http://localhost:14268/api/traces', neo4jUri: 'bolt://localhost:7687' }) // Start span for agent task const span = tracer.startSpan('enforce-tdd', { taskId: 'task-123', agentId: 'tdd-enforcer-001', projectId: 'llm-platform' }) try { // Execute agent task const result = await enforceTDD() span.setAttribute('tests.created', result.testsCreated) span.setAttribute('coverage.percentage', result.coverage) span.setStatus({ code: SpanStatusCode.OK }) return result } catch (error) { span.recordException(error) span.setStatus({ code: SpanStatusCode.ERROR }) throw error } finally { span.end() }
Trace LLM Calls
import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix' const phoenix = new PhoenixTracer({ endpoint: 'http://localhost:6006', project: 'llm-agents' }) // Trace Claude API call const result = await phoenix.traceLLMCall({ model: 'claude-sonnet-4-5-20250929', prompt: 'Generate tests for AuthService.ts', provider: 'anthropic', metadata: { agentId: 'tdd-enforcer-001', taskType: 'test-generation' } }) console.log('Tokens:', result.usage.totalTokens) console.log('Cost:', result.cost) console.log('Latency:', result.latency)
Record Metrics
import { metrics } from '@bluefly/agent-tracer' // Counter metrics.counter('agent.tasks.completed', { agentId: 'api-builder', status: 'success' }) // Gauge metrics.gauge('agent.queue.size', 42, { agentId: 'doc-sync' }) // Histogram metrics.histogram('agent.task.duration', 1250, { agentId: 'security-audit', taskType: 'vulnerability-scan' })
CloudEvents for Activity Streams
OSSA activity streams follow the CloudEvents specification for interoperability with event-driven systems (SigNoz, Kafka, EventBridge).
import { CloudEventEmitter } from '@bluefly/agent-tracer/cloudevents' const emitter = new CloudEventEmitter({ source: '/agents/review-agent/instance-123', type_prefix: 'io.ossa.agent' }) // Emit interaction completed event await emitter.emit({ type: 'io.ossa.agent.interaction.completed', data: { agent_id: 'review-agent', instance_id: 'uuid-instance', session_id: 'uuid-session', interaction_id: 'uuid-interaction', model: 'claude-sonnet-4-20250514', input_tokens: 1523, output_tokens: 892, latency_ms: 1250, finish_reason: 'stop', capabilities_used: ['code_review', 'security_analysis'] } })
CloudEvents Wire Format:
{ "specversion": "1.0", "type": "io.ossa.agent.interaction.completed", "source": "/agents/review-agent/instance-123", "id": "550e8400-e29b-41d4-a716-446655440000", "time": "2025-12-04T10:30:00.000Z", "datacontenttype": "application/json", "subject": "session/uuid-session/interaction/uuid-interaction", "data": { "agent_id": "review-agent", "instance_id": "uuid-instance", "session_id": "uuid-session", "interaction_id": "uuid-interaction", "model": "claude-sonnet-4-20250514", "input_tokens": 1523, "output_tokens": 892, "latency_ms": 1250, "finish_reason": "stop", "capabilities_used": ["code_review", "security_analysis"] } }
Standard Event Types:
| Event Type | Trigger |
|---|---|
io.ossa.agent.started | Agent instance initialized |
io.ossa.agent.session.created | New conversation session |
io.ossa.agent.interaction.started | Prompt received |
io.ossa.agent.interaction.completed | Response generated |
io.ossa.agent.interaction.failed | Error during generation |
io.ossa.agent.capability.invoked | Tool/capability called |
io.ossa.agent.handoff.initiated | Agent-to-agent delegation |
io.ossa.agent.stopped | Agent instance terminated |
Metrics Cardinality Controls
Critical for Prometheus - Prevent label explosion with high-cardinality attributes.
import { AgentTracer, CardinalityConfig } from '@bluefly/agent-tracer' const tracer = new AgentTracer({ cardinality: { // Allowed labels (low cardinality) allowed_labels: [ 'agent_name', // ~50 unique values 'capability_name', // ~100 unique values 'status', // success | error | timeout 'model', // ~10 unique values 'environment' // dev | staging | production ], // Forbidden labels (high cardinality - use traces instead) forbidden_labels: [ 'instance_id', // Thousands of values 'session_id', // Millions of values 'interaction_id', // Billions of values 'user_id', // High cardinality 'request_id' // High cardinality ], // Transform high-cardinality to buckets bucket_transforms: { 'latency_ms': [50, 100, 250, 500, 1000, 2500, 5000], 'token_count': [100, 500, 1000, 5000, 10000, 50000] } } })
Rule: Use metrics for aggregates (agent_name, status), traces for specifics (instance_id, session_id).
CLI Commands
Start Services
# Start ACE server agent-tracer ace start # ACE server running on http://localhost:3008 # Start ATLAS server agent-tracer atlas start # ATLAS server running on http://localhost:3009 # Start main tracer agent-tracer start # Tracer running on http://localhost:3007
View Traces
# List recent traces agent-tracer traces list --limit 10 # Get specific trace agent-tracer traces get --trace-id abc123 # Export traces agent-tracer traces export --output traces.json # Search traces agent-tracer traces search --service tdd-enforcer --status error
ACE Commands
# Score agent performance agent-tracer ace score --agent-id tdd-enforcer-001 # Benchmark multiple agents agent-tracer ace benchmark --agents tdd-enforcer,api-builder,doc-sync # View capabilities agent-tracer ace capabilities --agent-id tdd-enforcer-001 # Generate performance report agent-tracer ace report --agent-id tdd-enforcer-001 --output report.html
ATLAS Commands
# Analyze agent performance agent-tracer atlas analyze --agent-id tdd-enforcer-001 # Optimize workflow agent-tracer atlas optimize --workflow-id tdd-workflow-v1 # View historical trends agent-tracer atlas trends --agent-id tdd-enforcer-001 --days 30
Correlation Analysis
# Find correlations for trace agent-tracer correlate --trace-id abc123 # Root cause analysis agent-tracer rca --incident-id incident-789 # Impact analysis agent-tracer impact --service tdd-enforcer
ACE (AI Capabilities Engine)
Performance Scoring
ACE scores agent performance across multiple dimensions:
$ agent-tracer ace score --agent-id tdd-enforcer-001 ACE Performance Score Agent: tdd-enforcer-001 Period: Last 24 hours Overall Score: 88/100 Component Scores: Quality: 92/100 - Test coverage: 95% - Test quality: 90% - TDD compliance: 91% Efficiency: 85/100 - Task completion: 95% - Average latency: 1.2s - Token usage: 85k (optimal) Reliability: 87/100 - Success rate: 98% - Error rate: 2% - Uptime: 99.9% Recommendations: - Reduce token usage on simple tasks (-10%) - Improve error handling for edge cases - Cache common test patterns
Capability Matrix
$ agent-tracer ace capabilities --agent-id tdd-enforcer-001 Agent Capabilities Capability | Level | Confidence ------------------------|-------|------------ Test Generation | 95% | High Coverage Analysis | 92% | High TDD Enforcement | 88% | Medium Code Quality Check | 85% | Medium Security Validation | 78% | Medium Performance Testing | 65% | Low Strengths: Excellent at generating comprehensive test suites High accuracy in coverage analysis Strong TDD compliance enforcement Areas for Improvement: Security test generation needs work Performance test coverage is low
ATLAS (Agent Tracing & Learning Analytics)
Learning Analytics
$ agent-tracer atlas analyze --agent-id tdd-enforcer-001 ATLAS Learning Analytics Agent: tdd-enforcer-001 Analysis Period: 30 days Learning Progress: Task Success Rate: 87% 98% (+11%) Average Quality Score: 75 92 (+17) Token Efficiency: 120k 85k (-35k) Latency: 2.1s 1.2s (-0.9s) Key Learnings: Improved test pattern recognition Better context understanding Optimized prompt engineering Enhanced error recovery Optimization Opportunities: - Cache frequently used test templates - Pre-process common file patterns - Batch similar tasks for efficiency
Workflow Optimization
$ agent-tracer atlas optimize --workflow-id tdd-workflow-v1 Workflow Optimization Report Workflow: tdd-workflow-v1 Agents: tdd-enforcer, api-builder, doc-sync Bottlenecks Identified: 1. TDD Enforcer API Builder handoff (2.3s avg) 2. API Builder test validation (1.8s avg) 3. Doc Sync git operations (1.2s avg) Recommendations: Parallelize TDD enforcement and doc sync Cache API Builder validation results Use git worktrees for faster operations Estimated Improvement: -3.5s (-45%)
Dashboards
Pre-built Grafana Dashboards
Located in dashboard/:
- Agent Overview - High-level metrics for all agents
- LLM Performance - Model usage, costs, token tracking
- Trace Analysis - Distributed trace visualization
- ACE Scores - Agent capability scores over time
- ATLAS Analytics - Learning progress and optimization
- Infrastructure - System health and resource usage
Import Dashboards
# Import all dashboards curl -X POST http://localhost:3000/api/dashboards/import \ -H "Content-Type: application/json" \ -d @dashboard/agent-overview.json # Or via CLI agent-tracer dashboards import --all
Alerting
Alert Rules
Pre-configured in infrastructure/prometheus/alerts.yml:
groups: - name: agent_alerts rules: - alert: HighAgentErrorRate expr: rate(agent_errors_total[5m]) > 0.05 for: 10m labels: severity: warning annotations: summary: "High error rate for {{ $labels.agent_id }}" - alert: HighLLMCost expr: sum(llm_cost_usd) > 100 for: 1d labels: severity: critical annotations: summary: "Daily LLM cost exceeded $100" - alert: LowAgentQualityScore expr: agent_quality_score < 0.7 for: 1h labels: severity: warning annotations: summary: "Agent {{ $labels.agent_id }} quality score below threshold"
Configure Alertmanager
# alertmanager.yml global: slack_api_url: 'https://hooks.slack.com/services/...' route: group_by: ['alertname', 'agent_id'] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - channel: '#agent-alerts' title: 'Agent Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Integration with Buildkit
Automatic Instrumentation
# BuildKit automatically instruments agents buildkit agents deploy tdd-enforcer --with-tracing # View agent traces buildkit agents traces tdd-enforcer-001 # View agent metrics buildkit agents metrics tdd-enforcer-001 # ACE score buildkit agents score tdd-enforcer-001
API Reference
Full API documentation: OpenAPI Spec
Key Endpoints
Tracing
POST /api/v1/traces- Submit traceGET /api/v1/traces/:id- Get traceGET /api/v1/traces/search- Search traces
ACE
POST /api/v1/ace/score- Score agentGET /api/v1/ace/benchmarks- List benchmarksPOST /api/v1/ace/capabilities- Get capabilities
ATLAS
GET /api/v1/atlas/analytics/:agentId- Get analyticsPOST /api/v1/atlas/optimize- Optimize workflowGET /api/v1/atlas/trends- Historical trends
Metrics
GET /metrics- Prometheus metricsPOST /api/v1/metrics/custom- Submit custom metric