Phoenix Runbook
Overview
- Purpose: LLM observability platform for tracing, evaluating, and debugging AI/ML workloads. Provides visibility into LLM calls, token usage, latency, and response quality across the agent platform.
- Port: 6006
- Health endpoint:
GET /health or GET /api/health
- Namespace:
observability (Kubernetes)
- Technology: Arize Phoenix
Dependencies
- PostgreSQL (port 5432) - Optional, for persistent trace storage
- Agent Brain (port 3001) - Primary trace source
- Agent Router (port 3002) - LLM call tracing
Key Features
| Feature | Description |
|---|
| Trace Viewer | Visualize LLM call chains and latencies |
| Token Analytics | Track token usage by model, agent, workflow |
| Evaluation | Run evals on LLM outputs for quality |
| Embeddings | Visualize embedding distributions |
| Datasets | Manage evaluation datasets |
Common Issues
Issue 1: Traces Not Appearing
- Symptoms:
- Empty trace viewer
- No new traces being recorded
- Agents working but no visibility
- Cause:
- OpenTelemetry exporter not configured
- Phoenix collector not receiving data
- Instrumentation library missing
- Resolution:
# Check Phoenix is receiving data
curl http://localhost:6006/api/spans/count
# Verify OTEL endpoint configuration in agent-brain
kubectl get deployment agent-brain -n agents -o yaml | grep -A 5 OTEL
# Check if traces are being sent
kubectl logs deployment/agent-brain -n agents | grep -i "opentelemetry\|trace\|span"
# Test direct span submission
curl -X POST http://localhost:6006/v1/traces \
-H "Content-Type: application/json" \
-d '{"resourceSpans": []}'
# Verify Phoenix collector is running
kubectl logs deployment/phoenix -n observability | grep -i collector
# Restart agents to re-establish connection
kubectl rollout restart deployment/agent-brain -n agents
Issue 2: High Memory Usage
- Symptoms:
- Phoenix consuming excessive memory
- OOMKilled events
- Slow UI response
- Cause:
- Too many traces stored in memory
- Large span payloads
- No retention policy configured
- Resolution:
# Check memory usage
kubectl top pods -n observability -l app=phoenix
# Check trace count
curl http://localhost:6006/api/projects/default/spans?limit=1 | jq '.total'
# Enable persistence to reduce memory (if not using)
kubectl set env deployment/phoenix -n observability \
PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix
# Set retention policy
curl -X PUT http://localhost:6006/api/config \
-H "Content-Type: application/json" \
-d '{"retention_days": 7}'
# Force garbage collection
curl -X POST http://localhost:6006/api/gc
# Increase memory limits
kubectl set resources deployment/phoenix -n observability \
--limits=memory=8Gi --requests=memory=2Gi
Issue 3: Trace Latency / Slow UI
- Symptoms:
- Dashboard loading slowly
- Trace queries timing out
- High CPU usage
- Cause:
- Large trace volume
- Complex queries
- No indexing on trace data
- Resolution:
# Check trace volume
curl http://localhost:6006/api/stats
# Reduce time range in queries
# (Use UI filters to limit data)
# Enable sampling if high volume
kubectl set env deployment/phoenix -n observability \
PHOENIX_SAMPLE_RATE=0.1
# Optimize database (if using PostgreSQL)
psql -h postgresql -U postgres -d phoenix -c "ANALYZE;"
# Add indexes on common query patterns
psql -h postgresql -U postgres -d phoenix -c "CREATE INDEX IF NOT EXISTS idx_spans_start_time ON spans(start_time);"
Issue 4: Embedding Visualization Broken
- Symptoms:
- Embedding projections not loading
- "UMAP failed" errors
- Blank embedding explorer
- Cause:
- Not enough embeddings for projection
- Memory insufficient for UMAP
- Embedding dimension mismatch
- Resolution:
# Check embedding count
curl http://localhost:6006/api/embeddings/count
# Verify embedding dimensions are consistent
curl http://localhost:6006/api/embeddings/dimensions
# Force re-compute projections
curl -X POST http://localhost:6006/api/embeddings/reproject
# Reduce sample size for projections
kubectl set env deployment/phoenix -n observability \
PHOENIX_EMBEDDING_SAMPLE_SIZE=1000
Issue 5: Evaluations Failing
- Symptoms:
- Eval runs not completing
- "Evaluation error" in logs
- Metrics not being computed
- Cause:
- LLM-as-judge API failing
- Eval dataset malformed
- Rate limiting on eval model
- Resolution:
# Check eval job status
curl http://localhost:6006/api/evaluations/status
# View eval errors
curl http://localhost:6006/api/evaluations/{eval_id}/errors
# Retry failed evaluation
curl -X POST http://localhost:6006/api/evaluations/{eval_id}/retry
# Check LLM availability for evals
curl http://localhost:3002/api/v1/providers/health
# Reduce eval batch size
kubectl set env deployment/phoenix -n observability \
PHOENIX_EVAL_BATCH_SIZE=5
Issue 6: Data Loss After Restart
- Symptoms:
- Traces gone after pod restart
- Historical data missing
- Dashboard empty
- Cause:
- In-memory storage only
- PVC not configured
- Database connection lost
- Resolution:
# Check storage configuration
kubectl get deployment phoenix -n observability -o yaml | grep -A 10 volumes
# Enable persistent storage
kubectl set env deployment/phoenix -n observability \
PHOENIX_WORKING_DIR=/phoenix-data \
PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix
# Add PVC for local storage
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: phoenix-data
namespace: observability
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
EOF
# Mount PVC to deployment
kubectl patch deployment phoenix -n observability -p '
{
"spec": {
"template": {
"spec": {
"volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "phoenix-data"}}],
"containers": [{"name": "phoenix", "volumeMounts": [{"name": "data", "mountPath": "/phoenix-data"}]}]
}
}
}
}'
Restart Procedure
Graceful Restart (Recommended)
# 1. Check for active evaluations
curl http://localhost:6006/api/evaluations/active
# 2. Wait for evaluations to complete or pause
curl -X POST http://localhost:6006/api/evaluations/pause
# 3. Flush in-memory data to disk (if persistent storage)
curl -X POST http://localhost:6006/api/flush
# 4. Perform rolling restart
kubectl rollout restart deployment/phoenix -n observability
# 5. Wait for ready
kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s
# 6. Verify health
curl http://localhost:6006/health
# 7. Resume evaluations
curl -X POST http://localhost:6006/api/evaluations/resume
Emergency Restart
# Force restart
kubectl delete pod -l app=phoenix -n observability --force
# Wait for recovery
kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s
# Verify UI accessible
curl -I http://localhost:6006
Local Development Restart
# Docker
docker restart phoenix
# OrbStack
orb restart phoenix
# Using pip install
pkill -f "phoenix" && python -m phoenix.server.main
Logs Location
Kubernetes Logs
# Phoenix logs
kubectl logs -f deployment/phoenix -n observability
# Filter for errors
kubectl logs deployment/phoenix -n observability | grep -E "ERROR|WARN|Exception"
# Export logs
kubectl logs deployment/phoenix -n observability > phoenix-logs-$(date +%Y%m%d).txt
Application Logs
# Log level adjustment
kubectl set env deployment/phoenix -n observability LOG_LEVEL=DEBUG
# View collector logs specifically
kubectl logs deployment/phoenix -n observability | grep -i collector
Trace Debugging
# Export traces as JSON
curl "http://localhost:6006/api/projects/default/spans?limit=100" > traces.json
# Search for specific trace
curl "http://localhost:6006/api/traces/{trace_id}"
Scaling
Vertical Scaling
# Increase resources for larger trace volumes
kubectl set resources deployment/phoenix -n observability \
--limits=cpu=4000m,memory=16Gi \
--requests=cpu=1000m,memory=4Gi
Storage Scaling
# Expand PVC for more trace storage
kubectl patch pvc phoenix-data -n observability -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
Read Replica (for high query load)
# Deploy read-only Phoenix instance
kubectl apply -f phoenix-readonly-deployment.yaml
# Configure load balancing
kubectl apply -f phoenix-service-lb.yaml
Scaling Guidelines
| Metric | Threshold | Action |
|---|
| Memory Usage | > 80% | Increase memory, enable persistence |
| CPU Usage | > 70% | Increase CPU |
| Trace Ingestion Rate | > 1000/sec | Add sampling |
| Disk Usage | > 80% | Expand storage, reduce retention |
| Query Latency P99 | > 5s | Add indexes, reduce time range |
Alerts
| Alert | Condition | Runbook Action |
|---|
| PhoenixDown | Cannot connect for 2min | Emergency Restart |
| TraceLoss | 0 traces ingested for 10min | Check agent instrumentation |
| StorageFull | Disk > 95% | Reduce retention, expand storage |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|
| HighMemory | Memory > 80% | Enable persistence, increase limit |
| SlowQueries | Query latency > 10s | Add indexes, optimize |
| EvalFailures | >50% eval failures | Check LLM provider |
| HighTraceVolume | >10K traces/min | Enable sampling |
Prometheus Alert Rules
groups:
- name: phoenix
rules:
- alert: PhoenixDown
expr: up{job="phoenix"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Phoenix is down"
runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/phoenix"
- alert: PhoenixHighMemory
expr: phoenix_memory_usage_bytes / phoenix_memory_limit_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Phoenix memory usage high"
- alert: NoTracesIngested
expr: rate(phoenix_traces_ingested_total[10m]) == 0
for: 10m
labels:
severity: warning
annotations:
summary: "No traces being ingested"
Monitoring Dashboards
- Phoenix UI:
http://localhost:6006
- Grafana:
https://grafana.local/d/phoenix
- Prometheus:
https://prometheus.local/graph?g0.expr=up{job="phoenix"}
API Reference
Traces
# Get traces
curl "http://localhost:6006/api/projects/default/spans"
# Get specific trace
curl "http://localhost:6006/api/traces/{trace_id}"
# Delete old traces
curl -X DELETE "http://localhost:6006/api/traces?before=2024-01-01"
Evaluations
# List evaluations
curl http://localhost:6006/api/evaluations
# Create evaluation
curl -X POST http://localhost:6006/api/evaluations \
-H "Content-Type: application/json" \
-d '{
"name": "response-quality",
"dataset_id": "...",
"metrics": ["relevance", "fluency"]
}'
# Get evaluation results
curl http://localhost:6006/api/evaluations/{eval_id}/results
Datasets
# List datasets
curl http://localhost:6006/api/datasets
# Upload dataset
curl -X POST http://localhost:6006/api/datasets \
-H "Content-Type: application/json" \
-d '{"name": "test-set", "data": [...]}'
Integration
OpenTelemetry Setup
# Python instrumentation
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register(endpoint="http://localhost:6006/v1/traces")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Environment Variables
PHOENIX_COLLECTOR_ENDPOINT=http://phoenix:6006/v1/traces
PHOENIX_PROJECT_NAME=agent-platform
PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix
- On-call: PagerDuty rotation
- Slack: #platform-incidents
- Owner: AI/ML Team