phoenix

Phoenix Runbook

Overview

Purpose: LLM observability platform for tracing, evaluating, and debugging AI/ML workloads. Provides visibility into LLM calls, token usage, latency, and response quality across the agent platform.
Port: 6006
Health endpoint: GET /health or GET /api/health
Namespace: observability (Kubernetes)
Technology: Arize Phoenix

Dependencies

PostgreSQL (port 5432) - Optional, for persistent trace storage
Agent Brain (port 3001) - Primary trace source
Agent Router (port 3002) - LLM call tracing

Key Features

Feature	Description
Trace Viewer	Visualize LLM call chains and latencies
Token Analytics	Track token usage by model, agent, workflow
Evaluation	Run evals on LLM outputs for quality
Embeddings	Visualize embedding distributions
Datasets	Manage evaluation datasets

Common Issues

Issue 1: Traces Not Appearing

Symptoms:
- Empty trace viewer
- No new traces being recorded
- Agents working but no visibility
Cause:
- OpenTelemetry exporter not configured
- Phoenix collector not receiving data
- Instrumentation library missing

Resolution:

# Check Phoenix is receiving data
curl http://localhost:6006/api/spans/count

# Verify OTEL endpoint configuration in agent-brain
kubectl get deployment agent-brain -n agents -o yaml | grep -A 5 OTEL

# Check if traces are being sent
kubectl logs deployment/agent-brain -n agents | grep -i "opentelemetry\|trace\|span"

# Test direct span submission
curl -X POST http://localhost:6006/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans": []}'

# Verify Phoenix collector is running
kubectl logs deployment/phoenix -n observability | grep -i collector

# Restart agents to re-establish connection
kubectl rollout restart deployment/agent-brain -n agents

Issue 2: High Memory Usage

Symptoms:
- Phoenix consuming excessive memory
- OOMKilled events
- Slow UI response
Cause:
- Too many traces stored in memory
- Large span payloads
- No retention policy configured

Resolution:

# Check memory usage
kubectl top pods -n observability -l app=phoenix

# Check trace count
curl http://localhost:6006/api/projects/default/spans?limit=1 | jq '.total'

# Enable persistence to reduce memory (if not using)
kubectl set env deployment/phoenix -n observability \
  PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix

# Set retention policy
curl -X PUT http://localhost:6006/api/config \
  -H "Content-Type: application/json" \
  -d '{"retention_days": 7}'

# Force garbage collection
curl -X POST http://localhost:6006/api/gc

# Increase memory limits
kubectl set resources deployment/phoenix -n observability \
  --limits=memory=8Gi --requests=memory=2Gi

Issue 3: Trace Latency / Slow UI

Symptoms:
- Dashboard loading slowly
- Trace queries timing out
- High CPU usage
Cause:
- Large trace volume
- Complex queries
- No indexing on trace data

Resolution:

# Check trace volume
curl http://localhost:6006/api/stats

# Reduce time range in queries
# (Use UI filters to limit data)

# Enable sampling if high volume
kubectl set env deployment/phoenix -n observability \
  PHOENIX_SAMPLE_RATE=0.1

# Optimize database (if using PostgreSQL)
psql -h postgresql -U postgres -d phoenix -c "ANALYZE;"

# Add indexes on common query patterns
psql -h postgresql -U postgres -d phoenix -c "CREATE INDEX IF NOT EXISTS idx_spans_start_time ON spans(start_time);"

Issue 4: Embedding Visualization Broken

Symptoms:
- Embedding projections not loading
- "UMAP failed" errors
- Blank embedding explorer
Cause:
- Not enough embeddings for projection
- Memory insufficient for UMAP
- Embedding dimension mismatch

Resolution:

# Check embedding count
curl http://localhost:6006/api/embeddings/count

# Verify embedding dimensions are consistent
curl http://localhost:6006/api/embeddings/dimensions

# Force re-compute projections
curl -X POST http://localhost:6006/api/embeddings/reproject

# Reduce sample size for projections
kubectl set env deployment/phoenix -n observability \
  PHOENIX_EMBEDDING_SAMPLE_SIZE=1000

Issue 5: Evaluations Failing

Symptoms:
- Eval runs not completing
- "Evaluation error" in logs
- Metrics not being computed
Cause:
- LLM-as-judge API failing
- Eval dataset malformed
- Rate limiting on eval model

Resolution:

# Check eval job status
curl http://localhost:6006/api/evaluations/status

# View eval errors
curl http://localhost:6006/api/evaluations/{eval_id}/errors

# Retry failed evaluation
curl -X POST http://localhost:6006/api/evaluations/{eval_id}/retry

# Check LLM availability for evals
curl http://localhost:3002/api/v1/providers/health

# Reduce eval batch size
kubectl set env deployment/phoenix -n observability \
  PHOENIX_EVAL_BATCH_SIZE=5

Issue 6: Data Loss After Restart

Symptoms:
- Traces gone after pod restart
- Historical data missing
- Dashboard empty
Cause:
- In-memory storage only
- PVC not configured
- Database connection lost

Resolution:

# Check storage configuration
kubectl get deployment phoenix -n observability -o yaml | grep -A 10 volumes

# Enable persistent storage
kubectl set env deployment/phoenix -n observability \
  PHOENIX_WORKING_DIR=/phoenix-data \
  PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix

# Add PVC for local storage
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: phoenix-data
  namespace: observability
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
EOF

# Mount PVC to deployment
kubectl patch deployment phoenix -n observability -p '
{
  "spec": {
    "template": {
      "spec": {
        "volumes": [{"name": "data", "persistentVolumeClaim": {"claimName": "phoenix-data"}}],
        "containers": [{"name": "phoenix", "volumeMounts": [{"name": "data", "mountPath": "/phoenix-data"}]}]
      }
    }
  }
}'

Restart Procedure

Graceful Restart (Recommended)

# 1. Check for active evaluations
curl http://localhost:6006/api/evaluations/active

# 2. Wait for evaluations to complete or pause
curl -X POST http://localhost:6006/api/evaluations/pause

# 3. Flush in-memory data to disk (if persistent storage)
curl -X POST http://localhost:6006/api/flush

# 4. Perform rolling restart
kubectl rollout restart deployment/phoenix -n observability

# 5. Wait for ready
kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s

# 6. Verify health
curl http://localhost:6006/health

# 7. Resume evaluations
curl -X POST http://localhost:6006/api/evaluations/resume

Emergency Restart

# Force restart
kubectl delete pod -l app=phoenix -n observability --force

# Wait for recovery
kubectl wait --for=condition=ready pod -l app=phoenix -n observability --timeout=120s

# Verify UI accessible
curl -I http://localhost:6006

Local Development Restart

# Docker
docker restart phoenix

# OrbStack
orb restart phoenix

# Using pip install
pkill -f "phoenix" && python -m phoenix.server.main

Logs Location

Kubernetes Logs

# Phoenix logs
kubectl logs -f deployment/phoenix -n observability

# Filter for errors
kubectl logs deployment/phoenix -n observability | grep -E "ERROR|WARN|Exception"

# Export logs
kubectl logs deployment/phoenix -n observability > phoenix-logs-$(date +%Y%m%d).txt

Application Logs

# Log level adjustment
kubectl set env deployment/phoenix -n observability LOG_LEVEL=DEBUG

# View collector logs specifically
kubectl logs deployment/phoenix -n observability | grep -i collector

Trace Debugging

# Export traces as JSON
curl "http://localhost:6006/api/projects/default/spans?limit=100" > traces.json

# Search for specific trace
curl "http://localhost:6006/api/traces/{trace_id}"

Scaling

Vertical Scaling

# Increase resources for larger trace volumes
kubectl set resources deployment/phoenix -n observability \
  --limits=cpu=4000m,memory=16Gi \
  --requests=cpu=1000m,memory=4Gi

Storage Scaling

# Expand PVC for more trace storage
kubectl patch pvc phoenix-data -n observability -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

Read Replica (for high query load)

# Deploy read-only Phoenix instance
kubectl apply -f phoenix-readonly-deployment.yaml

# Configure load balancing
kubectl apply -f phoenix-service-lb.yaml

Scaling Guidelines

Metric	Threshold	Action
Memory Usage	> 80%	Increase memory, enable persistence
CPU Usage	> 70%	Increase CPU
Trace Ingestion Rate	> 1000/sec	Add sampling
Disk Usage	> 80%	Expand storage, reduce retention
Query Latency P99	> 5s	Add indexes, reduce time range

Alerts

Critical Alerts (PagerDuty)

Alert	Condition	Runbook Action
PhoenixDown	Cannot connect for 2min	Emergency Restart
TraceLoss	0 traces ingested for 10min	Check agent instrumentation
StorageFull	Disk > 95%	Reduce retention, expand storage

Warning Alerts (Slack)

Alert	Condition	Runbook Action
HighMemory	Memory > 80%	Enable persistence, increase limit
SlowQueries	Query latency > 10s	Add indexes, optimize
EvalFailures	>50% eval failures	Check LLM provider
HighTraceVolume	>10K traces/min	Enable sampling

Prometheus Alert Rules

groups:
  - name: phoenix
    rules:
      - alert: PhoenixDown
        expr: up{job="phoenix"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Phoenix is down"
          runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/phoenix"

      - alert: PhoenixHighMemory
        expr: phoenix_memory_usage_bytes / phoenix_memory_limit_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Phoenix memory usage high"

      - alert: NoTracesIngested
        expr: rate(phoenix_traces_ingested_total[10m]) == 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "No traces being ingested"

Monitoring Dashboards

Phoenix UI: http://localhost:6006
Grafana: https://grafana.local/d/phoenix
Prometheus: https://prometheus.local/graph?g0.expr=up{job="phoenix"}

API Reference

Traces

# Get traces
curl "http://localhost:6006/api/projects/default/spans"

# Get specific trace
curl "http://localhost:6006/api/traces/{trace_id}"

# Delete old traces
curl -X DELETE "http://localhost:6006/api/traces?before=2024-01-01"

Evaluations

# List evaluations
curl http://localhost:6006/api/evaluations

# Create evaluation
curl -X POST http://localhost:6006/api/evaluations \
  -H "Content-Type: application/json" \
  -d '{
    "name": "response-quality",
    "dataset_id": "...",
    "metrics": ["relevance", "fluency"]
  }'

# Get evaluation results
curl http://localhost:6006/api/evaluations/{eval_id}/results

Datasets

# List datasets
curl http://localhost:6006/api/datasets

# Upload dataset
curl -X POST http://localhost:6006/api/datasets \
  -H "Content-Type: application/json" \
  -d '{"name": "test-set", "data": [...]}'

Integration

OpenTelemetry Setup

# Python instrumentation
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register(endpoint="http://localhost:6006/v1/traces")
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Environment Variables

PHOENIX_COLLECTOR_ENDPOINT=http://phoenix:6006/v1/traces
PHOENIX_PROJECT_NAME=agent-platform
PHOENIX_SQL_DATABASE_URL=postgresql://postgres:password@postgresql:5432/phoenix

Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents
Owner: AI/ML Team

Agent Brain Runbook - Trace source
Agent Router Runbook - LLM call tracing
PostgreSQL Runbook - Trace storage

phoenix

Phoenix Runbook

Overview

Dependencies

Key Features

Common Issues

Issue 1: Traces Not Appearing

Issue 2: High Memory Usage

Issue 3: Trace Latency / Slow UI

Issue 4: Embedding Visualization Broken

Issue 5: Evaluations Failing

Issue 6: Data Loss After Restart

Restart Procedure

Graceful Restart (Recommended)

Emergency Restart

Local Development Restart

Logs Location

Kubernetes Logs

Application Logs

Trace Debugging

Scaling

Vertical Scaling

Storage Scaling

Read Replica (for high query load)

Scaling Guidelines

Alerts

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Prometheus Alert Rules

Monitoring Dashboards

API Reference

Traces

Evaluations

Datasets

Integration

OpenTelemetry Setup

Environment Variables

Contacts

Related Runbooks