Agent Tracer - AI Operations Intelligence

Agent Tracer is the unified observability platform for AI agents and LLM workflows across the LLM platform. It provides comprehensive tracing, metrics, and analytics for autonomous agent operations.

Overview

Agent Tracer combines AI-specific observability (Phoenix Arize), distributed tracing (Jaeger/Tempo), correlation analysis (Neo4j), and intelligent performance tracking into a single platform.

Key Capabilities

AI-Specific Tracing: LLM call tracking, token usage, prompt analysis
Distributed Tracing: OpenTelemetry-based cross-service tracing
Correlation Engine: Neo4j-powered relationship discovery
ACE: AI Capabilities Engine for performance scoring
ATLAS: Agent Tracing and Learning Analytics System
Metrics: Prometheus/Grafana integration
Alerting: Proactive issue detection

Architecture

graph TB
    subgraph "AI Agents"
        A1[TDD Enforcer]
        A2[API Builder]
        A3[Doc Sync]
        A4[Security Audit]
    end

    subgraph "Agent Tracer Core"
        AT[Agent Tracer]
        ACE[ACE Engine]
        ATLAS[ATLAS Analytics]
    end

    subgraph "Observability Stack"
        PHX[Phoenix Arize]
        JAE[Jaeger]
        NEO[Neo4j]
        PROM[Prometheus]
    end

    subgraph "Visualization"
        GRA[Grafana]
        UI[Phoenix UI]
    end

    A1 --> AT
    A2 --> AT
    A3 --> AT
    A4 --> AT

    AT --> PHX
    AT --> JAE
    AT --> NEO
    AT --> PROM

    ACE --> AT
    ATLAS --> AT

    PHX --> UI
    PROM --> GRA
    JAE --> GRA

Installation

NPM Package

# Install agent-tracer
npm install @bluefly/agent-tracer

# Or globally
npm install -g @bluefly/agent-tracer

# Verify installation
agent-tracer --version

Docker Stack

# Clone repository
git clone https://gitlab.com/blueflyio/agent-platform/agent-tracer.git
cd agent-tracer

# Start observability stack
docker-compose up -d

# Verify services
docker-compose ps

# Services started:
#   - Phoenix Arize: http://localhost:6006
#   - Jaeger UI: http://localhost:16686
#   - Neo4j Browser: http://localhost:7474
#   - Prometheus: http://localhost:9090
#   - Grafana: http://localhost:3000
#   - ACE Server: http://localhost:3008
#   - ATLAS Server: http://localhost:3009

Kubernetes Deployment

# Deploy via Helm
helm install agent-tracer ./infrastructure/helm/agent-tracer \
  --namespace observability \
  --create-namespace

# Verify deployment
kubectl get pods -n observability

# Access services
kubectl port-forward -n observability svc/phoenix 6006:6006
kubectl port-forward -n observability svc/jaeger 16686:16686

Configuration

Environment Variables

# Service Ports
AGENT_TRACER_PORT=3007
AGENT_TRACER_ACE_PORT=3008
AGENT_TRACER_ATLAS_PORT=3009

# Phoenix Arize
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
PHOENIX_PROJECT=llm-agents

# Jaeger
JAEGER_AGENT_HOST=localhost
JAEGER_AGENT_PORT=6831
JAEGER_COLLECTOR_ENDPOINT=http://localhost:14268/api/traces

# Tempo
TEMPO_ENDPOINT=http://localhost:4317

# Prometheus
PROMETHEUS_PUSHGATEWAY=http://localhost:9091

# Neo4j (Correlation)
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# Qdrant (Vector Storage)
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your-api-key

# Loki (Logging)
LOKI_URL=http://localhost:3100

# Alertmanager
ALERTMANAGER_URL=http://localhost:9093

Tracer Configuration

config/tracer.yaml:

tracer:
  serviceName: agent-tracer
  enabled: true
  samplingRate: 1.0

exporters:
  phoenix:
    enabled: true
    endpoint: http://localhost:6006
  jaeger:
    enabled: true
    endpoint: http://localhost:14268/api/traces
  tempo:
    enabled: true
    endpoint: http://localhost:4317

metrics:
  enabled: true
  prometheus:
    port: 9090
    path: /metrics

logging:
  level: info
  format: json
  loki:
    enabled: true
    endpoint: http://localhost:3100

Usage

Initialize Tracing in Agent

import { AgentTracer } from '@bluefly/agent-tracer'

// Initialize tracer
const tracer = new AgentTracer({
  serviceName: 'tdd-enforcer',
  phoenixEndpoint: 'http://localhost:6006',
  jaegerEndpoint: 'http://localhost:14268/api/traces',
  neo4jUri: 'bolt://localhost:7687'
})

// Start span for agent task
const span = tracer.startSpan('enforce-tdd', {
  taskId: 'task-123',
  agentId: 'tdd-enforcer-001',
  projectId: 'llm-platform'
})

try {
  // Execute agent task
  const result = await enforceTDD()

  span.setAttribute('tests.created', result.testsCreated)
  span.setAttribute('coverage.percentage', result.coverage)
  span.setStatus({ code: SpanStatusCode.OK })

  return result
} catch (error) {
  span.recordException(error)
  span.setStatus({ code: SpanStatusCode.ERROR })
  throw error
} finally {
  span.end()
}

Trace LLM Calls

import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix'

const phoenix = new PhoenixTracer({
  endpoint: 'http://localhost:6006',
  project: 'llm-agents'
})

// Trace Claude API call
const result = await phoenix.traceLLMCall({
  model: 'claude-sonnet-4-5-20250929',
  prompt: 'Generate tests for AuthService.ts',
  provider: 'anthropic',
  metadata: {
    agentId: 'tdd-enforcer-001',
    taskType: 'test-generation'
  }
})

console.log('Tokens:', result.usage.totalTokens)
console.log('Cost:', result.cost)
console.log('Latency:', result.latency)

Record Metrics

import { metrics } from '@bluefly/agent-tracer'

// Counter
metrics.counter('agent.tasks.completed', {
  agentId: 'api-builder',
  status: 'success'
})

// Gauge
metrics.gauge('agent.queue.size', 42, {
  agentId: 'doc-sync'
})

// Histogram
metrics.histogram('agent.task.duration', 1250, {
  agentId: 'security-audit',
  taskType: 'vulnerability-scan'
})

CloudEvents for Activity Streams

OSSA activity streams follow the CloudEvents specification for interoperability with event-driven systems (SigNoz, Kafka, EventBridge).

import { CloudEventEmitter } from '@bluefly/agent-tracer/cloudevents'

const emitter = new CloudEventEmitter({
  source: '/agents/review-agent/instance-123',
  type_prefix: 'io.ossa.agent'
})

// Emit interaction completed event
await emitter.emit({
  type: 'io.ossa.agent.interaction.completed',
  data: {
    agent_id: 'review-agent',
    instance_id: 'uuid-instance',
    session_id: 'uuid-session',
    interaction_id: 'uuid-interaction',
    model: 'claude-sonnet-4-20250514',
    input_tokens: 1523,
    output_tokens: 892,
    latency_ms: 1250,
    finish_reason: 'stop',
    capabilities_used: ['code_review', 'security_analysis']
  }
})

CloudEvents Wire Format:

{
  "specversion": "1.0",
  "type": "io.ossa.agent.interaction.completed",
  "source": "/agents/review-agent/instance-123",
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "time": "2025-12-04T10:30:00.000Z",
  "datacontenttype": "application/json",
  "subject": "session/uuid-session/interaction/uuid-interaction",
  "data": {
    "agent_id": "review-agent",
    "instance_id": "uuid-instance",
    "session_id": "uuid-session",
    "interaction_id": "uuid-interaction",
    "model": "claude-sonnet-4-20250514",
    "input_tokens": 1523,
    "output_tokens": 892,
    "latency_ms": 1250,
    "finish_reason": "stop",
    "capabilities_used": ["code_review", "security_analysis"]
  }
}

Standard Event Types:

Event Type	Trigger
`io.ossa.agent.started`	Agent instance initialized
`io.ossa.agent.session.created`	New conversation session
`io.ossa.agent.interaction.started`	Prompt received
`io.ossa.agent.interaction.completed`	Response generated
`io.ossa.agent.interaction.failed`	Error during generation
`io.ossa.agent.capability.invoked`	Tool/capability called
`io.ossa.agent.handoff.initiated`	Agent-to-agent delegation
`io.ossa.agent.stopped`	Agent instance terminated

Metrics Cardinality Controls

Critical for Prometheus - Prevent label explosion with high-cardinality attributes.

import { AgentTracer, CardinalityConfig } from '@bluefly/agent-tracer'

const tracer = new AgentTracer({
  cardinality: {
    // Allowed labels (low cardinality)
    allowed_labels: [
      'agent_name',      // ~50 unique values
      'capability_name', // ~100 unique values
      'status',          // success | error | timeout
      'model',           // ~10 unique values
      'environment'      // dev | staging | production
    ],

    // Forbidden labels (high cardinality - use traces instead)
    forbidden_labels: [
      'instance_id',     // Thousands of values
      'session_id',      // Millions of values
      'interaction_id',  // Billions of values
      'user_id',         // High cardinality
      'request_id'       // High cardinality
    ],

    // Transform high-cardinality to buckets
    bucket_transforms: {
      'latency_ms': [50, 100, 250, 500, 1000, 2500, 5000],
      'token_count': [100, 500, 1000, 5000, 10000, 50000]
    }
  }
})

Rule: Use metrics for aggregates (agent_name, status), traces for specifics (instance_id, session_id).

CLI Commands

Start Services

# Start ACE server
agent-tracer ace start
# ACE server running on http://localhost:3008

# Start ATLAS server
agent-tracer atlas start
# ATLAS server running on http://localhost:3009

# Start main tracer
agent-tracer start
# Tracer running on http://localhost:3007

View Traces

# List recent traces
agent-tracer traces list --limit 10

# Get specific trace
agent-tracer traces get --trace-id abc123

# Export traces
agent-tracer traces export --output traces.json

# Search traces
agent-tracer traces search --service tdd-enforcer --status error

ACE Commands

# Score agent performance
agent-tracer ace score --agent-id tdd-enforcer-001

# Benchmark multiple agents
agent-tracer ace benchmark --agents tdd-enforcer,api-builder,doc-sync

# View capabilities
agent-tracer ace capabilities --agent-id tdd-enforcer-001

# Generate performance report
agent-tracer ace report --agent-id tdd-enforcer-001 --output report.html

ATLAS Commands

# Analyze agent performance
agent-tracer atlas analyze --agent-id tdd-enforcer-001

# Optimize workflow
agent-tracer atlas optimize --workflow-id tdd-workflow-v1

# View historical trends
agent-tracer atlas trends --agent-id tdd-enforcer-001 --days 30

Correlation Analysis

# Find correlations for trace
agent-tracer correlate --trace-id abc123

# Root cause analysis
agent-tracer rca --incident-id incident-789

# Impact analysis
agent-tracer impact --service tdd-enforcer

ACE (AI Capabilities Engine)

Performance Scoring

ACE scores agent performance across multiple dimensions:

$ agent-tracer ace score --agent-id tdd-enforcer-001

ACE Performance Score


Agent: tdd-enforcer-001
Period: Last 24 hours

Overall Score: 88/100 

Component Scores:
  Quality: 92/100 
    - Test coverage: 95%
    - Test quality: 90%
    - TDD compliance: 91%

  Efficiency: 85/100 
    - Task completion: 95%
    - Average latency: 1.2s
    - Token usage: 85k (optimal)

  Reliability: 87/100 
    - Success rate: 98%
    - Error rate: 2%
    - Uptime: 99.9%

Recommendations:
  - Reduce token usage on simple tasks (-10%)
  - Improve error handling for edge cases
  - Cache common test patterns

Capability Matrix

$ agent-tracer ace capabilities --agent-id tdd-enforcer-001

Agent Capabilities


Capability              | Level | Confidence
------------------------|-------|------------
Test Generation         | 95%   | High
Coverage Analysis       | 92%   | High
TDD Enforcement         | 88%   | Medium
Code Quality Check      | 85%   | Medium
Security Validation     | 78%   | Medium
Performance Testing     | 65%   | Low

Strengths:
  Excellent at generating comprehensive test suites
  High accuracy in coverage analysis
  Strong TDD compliance enforcement

Areas for Improvement:
  Security test generation needs work
  Performance test coverage is low

ATLAS (Agent Tracing & Learning Analytics)

Learning Analytics

$ agent-tracer atlas analyze --agent-id tdd-enforcer-001

ATLAS Learning Analytics


Agent: tdd-enforcer-001
Analysis Period: 30 days

Learning Progress:
  Task Success Rate: 87%  98% (+11%) 
  Average Quality Score: 75  92 (+17) 
  Token Efficiency: 120k  85k (-35k) 
  Latency: 2.1s  1.2s (-0.9s) 

Key Learnings:
  Improved test pattern recognition
  Better context understanding
  Optimized prompt engineering
  Enhanced error recovery

Optimization Opportunities:
  - Cache frequently used test templates
  - Pre-process common file patterns
  - Batch similar tasks for efficiency

Workflow Optimization

$ agent-tracer atlas optimize --workflow-id tdd-workflow-v1

Workflow Optimization Report


Workflow: tdd-workflow-v1
Agents: tdd-enforcer, api-builder, doc-sync

Bottlenecks Identified:
  1. TDD Enforcer  API Builder handoff (2.3s avg)
  2. API Builder test validation (1.8s avg)
  3. Doc Sync git operations (1.2s avg)

Recommendations:
  Parallelize TDD enforcement and doc sync
  Cache API Builder validation results
  Use git worktrees for faster operations

Estimated Improvement: -3.5s (-45%)

Dashboards

Pre-built Grafana Dashboards

Located in dashboard/:

Agent Overview - High-level metrics for all agents
LLM Performance - Model usage, costs, token tracking
Trace Analysis - Distributed trace visualization
ACE Scores - Agent capability scores over time
ATLAS Analytics - Learning progress and optimization
Infrastructure - System health and resource usage

Import Dashboards

# Import all dashboards
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d @dashboard/agent-overview.json

# Or via CLI
agent-tracer dashboards import --all

Alerting

Alert Rules

Pre-configured in infrastructure/prometheus/alerts.yml:

groups:
  - name: agent_alerts
    rules:
      - alert: HighAgentErrorRate
        expr: rate(agent_errors_total[5m]) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High error rate for {{ $labels.agent_id }}"

      - alert: HighLLMCost
        expr: sum(llm_cost_usd) > 100
        for: 1d
        labels:
          severity: critical
        annotations:
          summary: "Daily LLM cost exceeded $100"

      - alert: LowAgentQualityScore
        expr: agent_quality_score < 0.7
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_id }} quality score below threshold"

Configure Alertmanager

# alertmanager.yml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'

route:
  group_by: ['alertname', 'agent_id']
  receiver: 'slack-notifications'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#agent-alerts'
        title: 'Agent Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Integration with Buildkit

Automatic Instrumentation

# BuildKit automatically instruments agents
buildkit agents deploy tdd-enforcer --with-tracing

# View agent traces
buildkit agents traces tdd-enforcer-001

# View agent metrics
buildkit agents metrics tdd-enforcer-001

# ACE score
buildkit agents score tdd-enforcer-001

API Reference

Full API documentation: OpenAPI Spec

Key Endpoints

Tracing

POST /api/v1/traces - Submit trace
GET /api/v1/traces/:id - Get trace
GET /api/v1/traces/search - Search traces

ACE

POST /api/v1/ace/score - Score agent
GET /api/v1/ace/benchmarks - List benchmarks
POST /api/v1/ace/capabilities - Get capabilities

ATLAS

GET /api/v1/atlas/analytics/:agentId - Get analytics
POST /api/v1/atlas/optimize - Optimize workflow
GET /api/v1/atlas/trends - Historical trends

Metrics

GET /metrics - Prometheus metrics
POST /api/v1/metrics/custom - Submit custom metric

Agent Tracer - AI Operations Intelligence

Agent Tracer - AI Operations Intelligence

Overview

Key Capabilities

Architecture

Installation

NPM Package

Docker Stack

Kubernetes Deployment

Configuration

Environment Variables

Tracer Configuration

Usage

Initialize Tracing in Agent

Trace LLM Calls

Record Metrics

CloudEvents for Activity Streams

Metrics Cardinality Controls

CLI Commands

Start Services

View Traces

ACE Commands

ATLAS Commands

Correlation Analysis

ACE (AI Capabilities Engine)

Performance Scoring

Capability Matrix

ATLAS (Agent Tracing & Learning Analytics)

Learning Analytics

Workflow Optimization

Dashboards

Pre-built Grafana Dashboards

Import Dashboards

Alerting

Alert Rules

Configure Alertmanager

Integration with Buildkit

Automatic Instrumentation

API Reference

Key Endpoints

Tracing

ACE

ATLAS

Metrics

Related Pages