Phoenix Arize - AI Observability

Phoenix Arize provides AI-specific observability for LLM calls, prompt analysis, and AI agent performance tracking. It's the primary tool for monitoring Claude API usage and AI operations.

Overview

Phoenix is purpose-built for AI/ML observability:

LLM Tracing: Track every Claude API call
Cost Tracking: Real-time cost monitoring
Prompt Analysis: Input/output analysis and optimization
Quality Metrics: Latency, throughput, quality scores
Visualization: Interactive trace explorer and flamegraphs

Installation

Docker (Recommended)

# Pull and run Phoenix
docker run -d \
  --name phoenix \
  -p 6006:6006 \
  -p 4317:4317 \
  arizephoenix/phoenix:latest

# Verify
curl http://localhost:6006/health

Docker Compose

# docker-compose.yml
version: '3.8'

services:
  phoenix:
    image: arizephoenix/phoenix:latest
    ports:
      - "6006:6006"   # UI
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    environment:
      - PHOENIX_WORKING_DIR=/phoenix-data
    volumes:
      - phoenix-data:/phoenix-data
    restart: unless-stopped

volumes:
  phoenix-data:

Kubernetes

# phoenix-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: phoenix
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: phoenix
  template:
    metadata:
      labels:
        app: phoenix
    spec:
      containers:
      - name: phoenix
        image: arizephoenix/phoenix:latest
        ports:
        - containerPort: 6006
          name: ui
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
---
apiVersion: v1
kind: Service
metadata:
  name: phoenix
  namespace: observability
spec:
  selector:
    app: phoenix
  ports:
  - name: ui
    port: 6006
    targetPort: 6006
  - name: otlp-grpc
    port: 4317
    targetPort: 4317

Configuration

Environment Variables

# Phoenix Configuration
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
PHOENIX_PROJECT=llm-agents
PHOENIX_OTLP_ENDPOINT=http://localhost:4317

# Optional
PHOENIX_WORKING_DIR=/data/phoenix
PHOENIX_PORT=6006
PHOENIX_GRPC_PORT=4317

Agent Tracer Integration

// config/tracer.yaml
exporters:
  phoenix:
    enabled: true
    endpoint: http://localhost:6006
    project: llm-agents
    otlpEndpoint: http://localhost:4317

Usage

Access Phoenix UI

# Open browser
open http://localhost:6006

# Or via kubectl port-forward
kubectl port-forward -n observability svc/phoenix 6006:6006
open http://localhost:6006

Initialize Phoenix Tracer

import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix'

const phoenix = new PhoenixTracer({
  endpoint: 'http://localhost:6006',
  project: 'llm-agents'
})

Trace LLM Calls

// Trace Claude API call
const result = await phoenix.traceLLMCall({
  model: 'claude-sonnet-4-5-20250929',
  prompt: `Generate comprehensive tests for the following code:

  ${code}`,
  provider: 'anthropic',
  temperature: 0.3,
  maxTokens: 4096,
  metadata: {
    agentId: 'tdd-enforcer-001',
    taskId: 'task-123',
    projectId: 'llm-platform',
    userId: 'user-456'
  }
})

// Trace data captured:
console.log('Trace ID:', result.traceId)
console.log('Span ID:', result.spanId)
console.log('Input Tokens:', result.usage.inputTokens)
console.log('Output Tokens:', result.usage.outputTokens)
console.log('Total Tokens:', result.usage.totalTokens)
console.log('Cost (USD):', result.cost)
console.log('Latency (ms):', result.latency)
console.log('Response:', result.response)

Manual Tracing

import { trace } from '@opentelemetry/api'

const tracer = trace.getTracer('agent-tracer')

const span = tracer.startSpan('llm.completion', {
  attributes: {
    'llm.provider': 'anthropic',
    'llm.model': 'claude-sonnet-4-5-20250929',
    'llm.temperature': 0.3,
    'llm.max_tokens': 4096
  }
})

try {
  const response = await callClaudeAPI(prompt)

  span.setAttributes({
    'llm.input_tokens': response.usage.input_tokens,
    'llm.output_tokens': response.usage.output_tokens,
    'llm.total_tokens': response.usage.input_tokens + response.usage.output_tokens,
    'llm.cost_usd': calculateCost(response.usage),
    'llm.response_time_ms': responseTime
  })

  span.setStatus({ code: SpanStatusCode.OK })
  return response
} catch (error) {
  span.recordException(error)
  span.setStatus({ code: SpanStatusCode.ERROR })
  throw error
} finally {
  span.end()
}

Phoenix UI Features

1. Trace Explorer

View all LLM calls with detailed information:

Request/response payloads
Token usage breakdown
Cost per call
Latency metrics
Error rates

2. Prompt Analysis

Analyze and optimize prompts:

Input/output comparison
Token efficiency
Cost optimization suggestions
Prompt versioning
A/B testing results

3. Model Comparison

Compare different models:

Model                              | Avg Latency | Avg Cost | Success Rate
-----------------------------------|-------------|----------|-------------
claude-sonnet-4-5-20250929         | 1.2s       | $0.08    | 98%
claude-opus-4-20250514             | 2.5s       | $0.24    | 99%
gpt-4-turbo                        | 1.8s       | $0.12    | 97%

4. Cost Dashboard

Track AI spending:

Daily/weekly/monthly costs
Cost by agent
Cost by task type
Cost trends
Budget alerts

5. Performance Metrics

Monitor LLM performance:

P50, P95, P99 latency
Throughput (requests/sec)
Error rates
Token usage patterns
Cache hit rates

LLM Metrics

Tracked Metrics

Phoenix automatically tracks:

{
  // Model Information
  'llm.provider': 'anthropic',
  'llm.model': 'claude-sonnet-4-5-20250929',
  'llm.temperature': 0.3,
  'llm.max_tokens': 4096,

  // Usage Metrics
  'llm.input_tokens': 1250,
  'llm.output_tokens': 850,
  'llm.total_tokens': 2100,

  // Cost Metrics
  'llm.input_cost_usd': 0.00375,   // $3/M tokens
  'llm.output_cost_usd': 0.01275,  // $15/M tokens
  'llm.total_cost_usd': 0.0165,

  // Performance Metrics
  'llm.latency_ms': 1234,
  'llm.ttft_ms': 245,              // Time to first token
  'llm.tokens_per_second': 688,

  // Quality Metrics
  'llm.finish_reason': 'stop',
  'llm.stop_reason': null,
  'llm.error': null,

  // Context
  'agent.id': 'tdd-enforcer-001',
  'task.id': 'task-123',
  'project.id': 'llm-platform',
  'user.id': 'user-456'
}

Cost Tracking

Cost Calculation

// Cost per model (as of Nov 2024)
const MODEL_COSTS = {
  'claude-sonnet-4-5-20250929': {
    input: 3.00,   // $3.00 per million tokens
    output: 15.00  // $15.00 per million tokens
  },
  'claude-opus-4-20250514': {
    input: 15.00,
    output: 75.00
  },
  'gpt-4-turbo': {
    input: 10.00,
    output: 30.00
  }
}

function calculateCost(model: string, usage: TokenUsage): number {
  const costs = MODEL_COSTS[model]
  const inputCost = (usage.inputTokens / 1_000_000) * costs.input
  const outputCost = (usage.outputTokens / 1_000_000) * costs.output
  return inputCost + outputCost
}

Daily Cost Report

$ agent-tracer phoenix cost-report --days 1

 LLM Cost Report - Last 24 Hours


Total Cost: $45.23

By Model:
  claude-sonnet-4-5-20250929:  $32.10 (71%)
  claude-opus-4-20250514:      $10.45 (23%)
  gpt-4-turbo:                 $2.68  (6%)

By Agent:
  tdd-enforcer-001:   $18.50 (41%)
  api-builder-002:    $12.30 (27%)
  doc-sync-003:       $8.20  (18%)
  security-audit-004: $6.23  (14%)

By Task Type:
  test-generation:    $22.15 (49%)
  code-review:        $12.80 (28%)
  documentation:      $6.50  (14%)
  security-scan:      $3.78  (8%)

Token Usage:
  Input:  1.2M tokens
  Output: 850k tokens
  Total:  2.05M tokens

Recommendations:
   Consider caching common test patterns
   Use Sonnet instead of Opus for simple tasks
   Token efficiency improved 15% vs yesterday

Prompt Optimization

Analyzing Prompts

Phoenix tracks prompt performance:

// Track prompt versions
const promptV1 = {
  template: 'Generate tests for: {code}',
  avgTokens: 2500,
  avgCost: 0.025,
  avgLatency: 1800,
  successRate: 0.85
}

const promptV2 = {
  template: 'As a senior test engineer, create comprehensive unit tests for the following TypeScript code:\n\n{code}\n\nInclude: edge cases, error handling, mocks',
  avgTokens: 3200,
  avgCost: 0.032,
  avgLatency: 2100,
  successRate: 0.97
}

// Compare in Phoenix UI
// Result: V2 has +12% higher cost but +12% better success rate
// Decision: Use V2 for critical code, V1 for simple utilities

Token Optimization

// BAD - Wasteful prompt
const badPrompt = `
You are an expert senior principal staff engineer with 20 years of experience
in test-driven development, clean code, SOLID principles, design patterns,
and software architecture. Your task is to carefully analyze the following
code and generate comprehensive, well-structured, maintainable unit tests...

[Long preamble continues...]

Here is the code to test:
${code}
`

// GOOD - Efficient prompt
const goodPrompt = `Generate comprehensive unit tests for:

${code}

Include: edge cases, error handling, mocks.
`

// Token savings: ~200 tokens per call
// Cost savings: $0.003 per call
// Daily savings (1000 calls): $3.00

Integration with Agent Tracer

Automatic Phoenix Integration

Agent Tracer automatically sends traces to Phoenix:

// No manual integration needed!
// Just configure endpoints in environment

const tracer = new AgentTracer({
  serviceName: 'tdd-enforcer',
  phoenixEndpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT
})

// All LLM calls automatically traced to Phoenix
const result = await callClaude(prompt)

View Agent Traces in Phoenix

# Filter by agent
http://localhost:6006/traces?filter=agent_id:tdd-enforcer-001

# Filter by cost
http://localhost:6006/traces?filter=cost_usd:>0.10

# Filter by latency
http://localhost:6006/traces?filter=latency_ms:>2000

# Filter by error
http://localhost:6006/traces?filter=error:true

Alerting

Cost Alerts

# phoenix-alerts.yml
alerts:
  - name: high_daily_cost
    condition: daily_cost_usd > 100
    notification:
      slack: '#agent-alerts'
      email: 'team@example.com'

  - name: high_single_call_cost
    condition: call_cost_usd > 1.00
    notification:
      slack: '#agent-alerts'

Performance Alerts

alerts:
  - name: high_latency
    condition: p95_latency_ms > 5000
    notification:
      slack: '#agent-alerts'

  - name: high_error_rate
    condition: error_rate > 0.05
    notification:
      pagerduty: true

Best Practices

1. Tag All Traces

phoenix.traceLLMCall({
  model: 'claude-sonnet-4-5-20250929',
  prompt: prompt,
  metadata: {
    // Always include
    agentId: 'tdd-enforcer-001',
    taskId: 'task-123',
    projectId: 'llm-platform',

    // Optional but helpful
    userId: 'user-456',
    feature: 'test-generation',
    environment: 'production',
    version: 'v1.2.3'
  }
})

2. Monitor Cost Trends

# Weekly cost review
agent-tracer phoenix cost-report --days 7 --trend

# Set budget alerts
agent-tracer phoenix alert create \
  --type cost \
  --threshold 500 \
  --period daily

3. Optimize High-Cost Tasks

# Find expensive tasks
agent-tracer phoenix analyze --sort-by cost --limit 10

# Result:
# 1. Code Review (complex): $2.50 avg
# 2. Architecture Analysis: $1.80 avg
# 3. Test Generation (full): $0.95 avg

# Optimization: Cache common patterns, use smaller model for simple tasks

4. A/B Test Prompts

// Test prompt versions
const results = await phoenix.abTest({
  variants: [
    { id: 'v1', prompt: promptV1 },
    { id: 'v2', prompt: promptV2 }
  ],
  traffic: { v1: 0.5, v2: 0.5 },
  duration: '7d',
  metrics: ['cost', 'latency', 'success_rate']
})

Troubleshooting

Phoenix Not Receiving Traces

# Check Phoenix is running
curl http://localhost:6006/health

# Check OTLP endpoint
curl http://localhost:4317

# Verify environment variables
echo $PHOENIX_COLLECTOR_ENDPOINT

# Check agent tracer logs
docker logs agent-tracer | grep phoenix

High Memory Usage

# Limit Phoenix data retention
docker run -e PHOENIX_DATA_RETENTION_DAYS=7 arizephoenix/phoenix

# Or in docker-compose
environment:
  - PHOENIX_DATA_RETENTION_DAYS=7
  - PHOENIX_MAX_TRACES=100000

Phoenix Arize - AI Observability

Phoenix Arize - AI Observability

Overview

Installation

Docker (Recommended)

Docker Compose

Kubernetes

Configuration

Environment Variables

Agent Tracer Integration

Usage

Access Phoenix UI

Initialize Phoenix Tracer

Trace LLM Calls

Manual Tracing

Phoenix UI Features

1. Trace Explorer

2. Prompt Analysis

3. Model Comparison

4. Cost Dashboard

5. Performance Metrics

LLM Metrics

Tracked Metrics

Cost Tracking

Cost Calculation

Daily Cost Report

Prompt Optimization

Analyzing Prompts

Token Optimization

Integration with Agent Tracer

Automatic Phoenix Integration

View Agent Traces in Phoenix

Alerting

Cost Alerts

Performance Alerts

Best Practices

1. Tag All Traces

2. Monitor Cost Trends

3. Optimize High-Cost Tasks

4. A/B Test Prompts

Troubleshooting

Phoenix Not Receiving Traces

High Memory Usage

Related Pages