Phoenix Arize - AI Observability
Phoenix Arize - AI Observability
Phoenix Arize provides AI-specific observability for LLM calls, prompt analysis, and AI agent performance tracking. It's the primary tool for monitoring Claude API usage and AI operations.
Overview
Phoenix is purpose-built for AI/ML observability:
- LLM Tracing: Track every Claude API call
- Cost Tracking: Real-time cost monitoring
- Prompt Analysis: Input/output analysis and optimization
- Quality Metrics: Latency, throughput, quality scores
- Visualization: Interactive trace explorer and flamegraphs
Installation
Docker (Recommended)
# Pull and run Phoenix docker run -d \ --name phoenix \ -p 6006:6006 \ -p 4317:4317 \ arizephoenix/phoenix:latest # Verify curl http://localhost:6006/health
Docker Compose
# docker-compose.yml version: '3.8' services: phoenix: image: arizephoenix/phoenix:latest ports: - "6006:6006" # UI - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP environment: - PHOENIX_WORKING_DIR=/phoenix-data volumes: - phoenix-data:/phoenix-data restart: unless-stopped volumes: phoenix-data:
Kubernetes
# phoenix-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: phoenix namespace: observability spec: replicas: 1 selector: matchLabels: app: phoenix template: metadata: labels: app: phoenix spec: containers: - name: phoenix image: arizephoenix/phoenix:latest ports: - containerPort: 6006 name: ui - containerPort: 4317 name: otlp-grpc - containerPort: 4318 name: otlp-http --- apiVersion: v1 kind: Service metadata: name: phoenix namespace: observability spec: selector: app: phoenix ports: - name: ui port: 6006 targetPort: 6006 - name: otlp-grpc port: 4317 targetPort: 4317
Configuration
Environment Variables
# Phoenix Configuration PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006 PHOENIX_PROJECT=llm-agents PHOENIX_OTLP_ENDPOINT=http://localhost:4317 # Optional PHOENIX_WORKING_DIR=/data/phoenix PHOENIX_PORT=6006 PHOENIX_GRPC_PORT=4317
Agent Tracer Integration
// config/tracer.yaml exporters: phoenix: enabled: true endpoint: http://localhost:6006 project: llm-agents otlpEndpoint: http://localhost:4317
Usage
Access Phoenix UI
# Open browser open http://localhost:6006 # Or via kubectl port-forward kubectl port-forward -n observability svc/phoenix 6006:6006 open http://localhost:6006
Initialize Phoenix Tracer
import { PhoenixTracer } from '@bluefly/agent-tracer/integrations/phoenix' const phoenix = new PhoenixTracer({ endpoint: 'http://localhost:6006', project: 'llm-agents' })
Trace LLM Calls
// Trace Claude API call const result = await phoenix.traceLLMCall({ model: 'claude-sonnet-4-5-20250929', prompt: `Generate comprehensive tests for the following code: ${code}`, provider: 'anthropic', temperature: 0.3, maxTokens: 4096, metadata: { agentId: 'tdd-enforcer-001', taskId: 'task-123', projectId: 'llm-platform', userId: 'user-456' } }) // Trace data captured: console.log('Trace ID:', result.traceId) console.log('Span ID:', result.spanId) console.log('Input Tokens:', result.usage.inputTokens) console.log('Output Tokens:', result.usage.outputTokens) console.log('Total Tokens:', result.usage.totalTokens) console.log('Cost (USD):', result.cost) console.log('Latency (ms):', result.latency) console.log('Response:', result.response)
Manual Tracing
import { trace } from '@opentelemetry/api' const tracer = trace.getTracer('agent-tracer') const span = tracer.startSpan('llm.completion', { attributes: { 'llm.provider': 'anthropic', 'llm.model': 'claude-sonnet-4-5-20250929', 'llm.temperature': 0.3, 'llm.max_tokens': 4096 } }) try { const response = await callClaudeAPI(prompt) span.setAttributes({ 'llm.input_tokens': response.usage.input_tokens, 'llm.output_tokens': response.usage.output_tokens, 'llm.total_tokens': response.usage.input_tokens + response.usage.output_tokens, 'llm.cost_usd': calculateCost(response.usage), 'llm.response_time_ms': responseTime }) span.setStatus({ code: SpanStatusCode.OK }) return response } catch (error) { span.recordException(error) span.setStatus({ code: SpanStatusCode.ERROR }) throw error } finally { span.end() }
Phoenix UI Features
1. Trace Explorer
View all LLM calls with detailed information:
- Request/response payloads
- Token usage breakdown
- Cost per call
- Latency metrics
- Error rates
2. Prompt Analysis
Analyze and optimize prompts:
- Input/output comparison
- Token efficiency
- Cost optimization suggestions
- Prompt versioning
- A/B testing results
3. Model Comparison
Compare different models:
Model | Avg Latency | Avg Cost | Success Rate
-----------------------------------|-------------|----------|-------------
claude-sonnet-4-5-20250929 | 1.2s | $0.08 | 98%
claude-opus-4-20250514 | 2.5s | $0.24 | 99%
gpt-4-turbo | 1.8s | $0.12 | 97%
4. Cost Dashboard
Track AI spending:
- Daily/weekly/monthly costs
- Cost by agent
- Cost by task type
- Cost trends
- Budget alerts
5. Performance Metrics
Monitor LLM performance:
- P50, P95, P99 latency
- Throughput (requests/sec)
- Error rates
- Token usage patterns
- Cache hit rates
LLM Metrics
Tracked Metrics
Phoenix automatically tracks:
{ // Model Information 'llm.provider': 'anthropic', 'llm.model': 'claude-sonnet-4-5-20250929', 'llm.temperature': 0.3, 'llm.max_tokens': 4096, // Usage Metrics 'llm.input_tokens': 1250, 'llm.output_tokens': 850, 'llm.total_tokens': 2100, // Cost Metrics 'llm.input_cost_usd': 0.00375, // $3/M tokens 'llm.output_cost_usd': 0.01275, // $15/M tokens 'llm.total_cost_usd': 0.0165, // Performance Metrics 'llm.latency_ms': 1234, 'llm.ttft_ms': 245, // Time to first token 'llm.tokens_per_second': 688, // Quality Metrics 'llm.finish_reason': 'stop', 'llm.stop_reason': null, 'llm.error': null, // Context 'agent.id': 'tdd-enforcer-001', 'task.id': 'task-123', 'project.id': 'llm-platform', 'user.id': 'user-456' }
Cost Tracking
Cost Calculation
// Cost per model (as of Nov 2024) const MODEL_COSTS = { 'claude-sonnet-4-5-20250929': { input: 3.00, // $3.00 per million tokens output: 15.00 // $15.00 per million tokens }, 'claude-opus-4-20250514': { input: 15.00, output: 75.00 }, 'gpt-4-turbo': { input: 10.00, output: 30.00 } } function calculateCost(model: string, usage: TokenUsage): number { const costs = MODEL_COSTS[model] const inputCost = (usage.inputTokens / 1_000_000) * costs.input const outputCost = (usage.outputTokens / 1_000_000) * costs.output return inputCost + outputCost }
Daily Cost Report
$ agent-tracer phoenix cost-report --days 1 LLM Cost Report - Last 24 Hours Total Cost: $45.23 By Model: claude-sonnet-4-5-20250929: $32.10 (71%) claude-opus-4-20250514: $10.45 (23%) gpt-4-turbo: $2.68 (6%) By Agent: tdd-enforcer-001: $18.50 (41%) api-builder-002: $12.30 (27%) doc-sync-003: $8.20 (18%) security-audit-004: $6.23 (14%) By Task Type: test-generation: $22.15 (49%) code-review: $12.80 (28%) documentation: $6.50 (14%) security-scan: $3.78 (8%) Token Usage: Input: 1.2M tokens Output: 850k tokens Total: 2.05M tokens Recommendations: Consider caching common test patterns Use Sonnet instead of Opus for simple tasks Token efficiency improved 15% vs yesterday
Prompt Optimization
Analyzing Prompts
Phoenix tracks prompt performance:
// Track prompt versions const promptV1 = { template: 'Generate tests for: {code}', avgTokens: 2500, avgCost: 0.025, avgLatency: 1800, successRate: 0.85 } const promptV2 = { template: 'As a senior test engineer, create comprehensive unit tests for the following TypeScript code:\n\n{code}\n\nInclude: edge cases, error handling, mocks', avgTokens: 3200, avgCost: 0.032, avgLatency: 2100, successRate: 0.97 } // Compare in Phoenix UI // Result: V2 has +12% higher cost but +12% better success rate // Decision: Use V2 for critical code, V1 for simple utilities
Token Optimization
// BAD - Wasteful prompt const badPrompt = ` You are an expert senior principal staff engineer with 20 years of experience in test-driven development, clean code, SOLID principles, design patterns, and software architecture. Your task is to carefully analyze the following code and generate comprehensive, well-structured, maintainable unit tests... [Long preamble continues...] Here is the code to test: ${code} ` // GOOD - Efficient prompt const goodPrompt = `Generate comprehensive unit tests for: ${code} Include: edge cases, error handling, mocks. ` // Token savings: ~200 tokens per call // Cost savings: $0.003 per call // Daily savings (1000 calls): $3.00
Integration with Agent Tracer
Automatic Phoenix Integration
Agent Tracer automatically sends traces to Phoenix:
// No manual integration needed! // Just configure endpoints in environment const tracer = new AgentTracer({ serviceName: 'tdd-enforcer', phoenixEndpoint: process.env.PHOENIX_COLLECTOR_ENDPOINT }) // All LLM calls automatically traced to Phoenix const result = await callClaude(prompt)
View Agent Traces in Phoenix
# Filter by agent http://localhost:6006/traces?filter=agent_id:tdd-enforcer-001 # Filter by cost http://localhost:6006/traces?filter=cost_usd:>0.10 # Filter by latency http://localhost:6006/traces?filter=latency_ms:>2000 # Filter by error http://localhost:6006/traces?filter=error:true
Alerting
Cost Alerts
# phoenix-alerts.yml alerts: - name: high_daily_cost condition: daily_cost_usd > 100 notification: slack: '#agent-alerts' email: 'team@example.com' - name: high_single_call_cost condition: call_cost_usd > 1.00 notification: slack: '#agent-alerts'
Performance Alerts
alerts: - name: high_latency condition: p95_latency_ms > 5000 notification: slack: '#agent-alerts' - name: high_error_rate condition: error_rate > 0.05 notification: pagerduty: true
Best Practices
1. Tag All Traces
phoenix.traceLLMCall({ model: 'claude-sonnet-4-5-20250929', prompt: prompt, metadata: { // Always include agentId: 'tdd-enforcer-001', taskId: 'task-123', projectId: 'llm-platform', // Optional but helpful userId: 'user-456', feature: 'test-generation', environment: 'production', version: 'v1.2.3' } })
2. Monitor Cost Trends
# Weekly cost review agent-tracer phoenix cost-report --days 7 --trend # Set budget alerts agent-tracer phoenix alert create \ --type cost \ --threshold 500 \ --period daily
3. Optimize High-Cost Tasks
# Find expensive tasks agent-tracer phoenix analyze --sort-by cost --limit 10 # Result: # 1. Code Review (complex): $2.50 avg # 2. Architecture Analysis: $1.80 avg # 3. Test Generation (full): $0.95 avg # Optimization: Cache common patterns, use smaller model for simple tasks
4. A/B Test Prompts
// Test prompt versions const results = await phoenix.abTest({ variants: [ { id: 'v1', prompt: promptV1 }, { id: 'v2', prompt: promptV2 } ], traffic: { v1: 0.5, v2: 0.5 }, duration: '7d', metrics: ['cost', 'latency', 'success_rate'] })
Troubleshooting
Phoenix Not Receiving Traces
# Check Phoenix is running curl http://localhost:6006/health # Check OTLP endpoint curl http://localhost:4317 # Verify environment variables echo $PHOENIX_COLLECTOR_ENDPOINT # Check agent tracer logs docker logs agent-tracer | grep phoenix
High Memory Usage
# Limit Phoenix data retention docker run -e PHOENIX_DATA_RETENTION_DAYS=7 arizephoenix/phoenix # Or in docker-compose environment: - PHOENIX_DATA_RETENTION_DAYS=7 - PHOENIX_MAX_TRACES=100000