Skip to main content

phoenix

Phoenix Runbook

Overview

  • Purpose: LLM observability platform for tracing, evaluating, and debugging AI/ML workloads. Provides visibility into LLM calls, token usage, latency, and response quality across the agent platform.
  • Port: 6006
  • Health endpoint: GET /health or GET /api/health
  • Namespace: observability (Kubernetes)
  • Technology: Arize Phoenix

Dependencies

  • PostgreSQL (port 5432) - Optional, for persistent trace storage
  • Agent Brain (port 3001) - Primary trace source
  • Agent Router (port 3002) - LLM call tracing

Key Features

FeatureDescription
Trace ViewerVisualize LLM call chains and latencies
Token AnalyticsTrack token usage by model, agent, workflow
EvaluationRun evals on LLM outputs for quality
EmbeddingsVisualize embedding distributions
DatasetsManage evaluation datasets

Common Issues

Issue 1: Traces Not Appearing

  • Symptoms:
    • Empty trace viewer
    • No new traces being recorded
    • Agents working but no visibility
  • Cause:
    • OpenTelemetry exporter not configured
    • Phoenix collector not receiving data
    • Instrumentation library missing
  • Resolution:
    # Check Phoenix is receiving data curl http://localhost:6006/api/spans/count # Verify OTEL endpoint configuration in agent-brain kubectl get deployment agent-brain -n agents -o yaml | grep -A 5 OTEL # Check if traces are being sent kubectl logs deployment/agent-brain -n agents | grep -i "opentelemetry\|trace\|span" # Test direct span submission curl -X POST http://localhost:6006/v1/traces \ -H "Content-Type: application/json" \ -d '{"resourceSpans": []}' # Verify Phoenix collector is running kubectl logs deployment/phoenix -n observability | grep -i collector # Restart agents to re-establish connection kubectl rollout restart deployment/agent-brain -n agents

Issue 2: High Memory Usage

  • Symptoms:
    • Phoenix consuming excessive memory
    • OOMKilled events
    • Slow UI response
  • Cause:
    • Too many traces stored in memory
    • Large span payloads
    • No retention policy configured
  • Resolution:
    # Check memory usage kubectl top pods -n observability -l app=phoenix # Check trace count curl http://localhost:6006/api/projects/default/spans?limit=1 | jq '.total' # Enable persistence to reduce memory (if not using) kubectl set env deployment/phoenix -n observability \ PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix # Set retention policy curl -X PUT http://localhost:6006/api/config \ -H "Content-Type: application/json" \ -d '{"retention_days": 7}' # Force garbage collection curl -X POST http://localhost:6006/api/gc # Increase memory limits kubectl set resources deployment/phoenix -n observability \ --limits=memory=8Gi --requests=memory=2Gi

Issue 3: Trace Latency / Slow UI

  • Symptoms:
    • Dashboard loading slowly
    • Trace queries timing out
    • High CPU usage
  • Cause:
    • Large trace volume
    • Complex queries
    • No indexing on trace data
  • Resolution:
    # Check trace volume curl http://localhost:6006/api/stats # Reduce time range in queries # (Use UI filters to limit data) # Enable sampling if high volume kubectl set env deployment/phoenix -n observability \ PHOENIX_SAMPLE_RATE=0.1 # Optimize database (if using PostgreSQL) psql -h postgresql -U postgres -d phoenix -c "ANALYZE;" # Add indexes on common query patterns psql -h postgresql -U postgres -d phoenix -c "CREATE INDEX IF NOT EXISTS idx_spans_start_time ON spans(start_time);"

Issue 4: Embedding Visualization Broken

  • Symptoms:
    • Embedding projections not loading
    • "UMAP failed" errors
    • Blank embedding explorer
  • Cause:
    • Not enough embeddings for projection
    • Memory insufficient for UMAP
    • Embedding dimension mismatch
  • Resolution:
    # Check embedding count curl http://localhost:6006/api/embeddings/count # Verify embedding dimensions are consistent curl http://localhost:6006/api/embeddings/dimensions # Force re-compute projections curl -X POST http://localhost:6006/api/embeddings/reproject # Reduce sample size for projections kubectl set env deployment/phoenix -n observability \ PHOENIX_EMBEDDING_SAMPLE_SIZE=1000

Issue 5: Evaluations Failing

  • Symptoms:
    • Eval runs not completing
    • "Evaluation error" in logs
    • Metrics not being computed
  • Cause:
    • LLM-as-judge API failing
    • Eval dataset malformed
    • Rate limiting on eval model
  • Resolution:
    # Check eval job status curl http://localhost:6006/api/evaluations/status # View eval errors curl http://localhost:6006/api/evaluations/{eval_id}/errors # Retry failed evaluation curl -X POST http://localhost:6006/api/evaluations/{eval_id}/retry # Check LLM availability for evals curl http://localhost:3002/api/v1/providers/health # Reduce eval batch size kubectl set env deployment/phoenix -n observability \ PHOENIX_EVAL_BATCH_SIZE=5

Issue 6: Data Loss After Restart

  • Symptoms:
    • Traces gone after pod restart
    • Historical data missing
    • Dashboard empty
  • Cause:
    • In-memory storage only
    • PVC not configured
    • Database connection lost
  • Resolution:
    # Check storage configuration kubectl get deployment phoenix -n observability -o yaml | grep -A 10 volumes # Enable persistent storage kubectl set env deployment/phoenix -n observability \ PHOENIX_WORKING_DIR=/phoenix-data \ PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix # Add PVC for local storage kubectl apply -f - <<EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: phoenix-data namespace: observability spec: accessModes: [ReadWriteOnce] resources: requests: storage: 50Gi EOF # Mount PVC to deployment kubectl patch deployment phoenix -n observability -p ' { "spec": { "template": { "spec": { "volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "phoenix-data"}}], "containers": [{"name": "phoenix", "volumeMounts": [{"name": "data", "mountPath": "/phoenix-data"}]}] } } } }'

Restart Procedure

# 1. Check for active evaluations curl http://localhost:6006/api/evaluations/active # 2. Wait for evaluations to complete or pause curl -X POST http://localhost:6006/api/evaluations/pause # 3. Flush in-memory data to disk (if persistent storage) curl -X POST http://localhost:6006/api/flush # 4. Perform rolling restart kubectl rollout restart deployment/phoenix -n observability # 5. Wait for ready kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s # 6. Verify health curl http://localhost:6006/health # 7. Resume evaluations curl -X POST http://localhost:6006/api/evaluations/resume

Emergency Restart

# Force restart kubectl delete pod -l app=phoenix -n observability --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s # Verify UI accessible curl -I http://localhost:6006

Local Development Restart

# Docker docker restart phoenix # OrbStack orb restart phoenix # Using pip install pkill -f "phoenix" && python -m phoenix.server.main

Logs Location

Kubernetes Logs

# Phoenix logs kubectl logs -f deployment/phoenix -n observability # Filter for errors kubectl logs deployment/phoenix -n observability | grep -E "ERROR|WARN|Exception" # Export logs kubectl logs deployment/phoenix -n observability > phoenix-logs-$(date +%Y%m%d).txt

Application Logs

# Log level adjustment kubectl set env deployment/phoenix -n observability LOG_LEVEL=DEBUG # View collector logs specifically kubectl logs deployment/phoenix -n observability | grep -i collector

Trace Debugging

# Export traces as JSON curl "http://localhost:6006/api/projects/default/spans?limit=100" > traces.json # Search for specific trace curl "http://localhost:6006/api/traces/{trace_id}"

Scaling

Vertical Scaling

# Increase resources for larger trace volumes kubectl set resources deployment/phoenix -n observability \ --limits=cpu=4000m,memory=16Gi \ --requests=cpu=1000m,memory=4Gi

Storage Scaling

# Expand PVC for more trace storage kubectl patch pvc phoenix-data -n observability -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

Read Replica (for high query load)

# Deploy read-only Phoenix instance kubectl apply -f phoenix-readonly-deployment.yaml # Configure load balancing kubectl apply -f phoenix-service-lb.yaml

Scaling Guidelines

MetricThresholdAction
Memory Usage> 80%Increase memory, enable persistence
CPU Usage> 70%Increase CPU
Trace Ingestion Rate> 1000/secAdd sampling
Disk Usage> 80%Expand storage, reduce retention
Query Latency P99> 5sAdd indexes, reduce time range

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
PhoenixDownCannot connect for 2minEmergency Restart
TraceLoss0 traces ingested for 10minCheck agent instrumentation
StorageFullDisk > 95%Reduce retention, expand storage

Warning Alerts (Slack)

AlertConditionRunbook Action
HighMemoryMemory > 80%Enable persistence, increase limit
SlowQueriesQuery latency > 10sAdd indexes, optimize
EvalFailures>50% eval failuresCheck LLM provider
HighTraceVolume>10K traces/minEnable sampling

Prometheus Alert Rules

groups: - name: phoenix rules: - alert: PhoenixDown expr: up{job="phoenix"} == 0 for: 2m labels: severity: critical annotations: summary: "Phoenix is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/phoenix" - alert: PhoenixHighMemory expr: phoenix_memory_usage_bytes / phoenix_memory_limit_bytes > 0.8 for: 5m labels: severity: warning annotations: summary: "Phoenix memory usage high" - alert: NoTracesIngested expr: rate(phoenix_traces_ingested_total[10m]) == 0 for: 10m labels: severity: warning annotations: summary: "No traces being ingested"

Monitoring Dashboards

  • Phoenix UI: http://localhost:6006
  • Grafana: https://grafana.local/d/phoenix
  • Prometheus: https://prometheus.local/graph?g0.expr=up{job="phoenix"}

API Reference

Traces

# Get traces curl "http://localhost:6006/api/projects/default/spans" # Get specific trace curl "http://localhost:6006/api/traces/{trace_id}" # Delete old traces curl -X DELETE "http://localhost:6006/api/traces?before=2024-01-01"

Evaluations

# List evaluations curl http://localhost:6006/api/evaluations # Create evaluation curl -X POST http://localhost:6006/api/evaluations \ -H "Content-Type: application/json" \ -d '{ "name": "response-quality", "dataset_id": "...", "metrics": ["relevance", "fluency"] }' # Get evaluation results curl http://localhost:6006/api/evaluations/{eval_id}/results

Datasets

# List datasets curl http://localhost:6006/api/datasets # Upload dataset curl -X POST http://localhost:6006/api/datasets \ -H "Content-Type: application/json" \ -d '{"name": "test-set", "data": [...]}'

Integration

OpenTelemetry Setup

# Python instrumentation from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor tracer_provider = register(endpoint="http://localhost:6006/v1/traces") OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Environment Variables

PHOENIX_COLLECTOR_ENDPOINT=http://phoenix:6006/v1/traces PHOENIX_PROJECT_NAME=agent-platform PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix

Contacts

  • On-call: PagerDuty rotation
  • Slack: #platform-incidents
  • Owner: AI/ML Team