README

Production Service Runbooks

Separation of Duties: See Separation of Duties - Runbooks document operational procedures. They do NOT own agent manifests, execution, or infrastructure configuration.

Vast.ai Integration: See BULLETPROOF_VASTAI_PLAN.md - Complete Vast.ai implementation plan.

This directory contains operational runbooks for all production services in the agent platform.

Agent Services

Service	Port	Runbook	Description
Agent BuildKit	N/A (CLI)	agent-buildkit.md	Agent lifecycle management and CLI toolkit
Agent Mesh	3005	agent-mesh.md	Distributed coordination and gRPC server
Agent Brain	3006	agent-brain.md	Knowledge graph and vector database
Agent Router	3007	agent-router.md	LLM routing and load balancing
Agent Tracer	3002	agent-tracer.md	Observability and tracing (ACE/ATLAS)
Agent Studio	3008	agent-studio.md	IDE suite and development environment
Workflow Engine	3004	workflow-engine.md	Workflow orchestration and Langflow bridge

CI/CD Components

Service	Port	Runbook	Description
GitLab Components	N/A	gitlab-components.md	CI/CD component library (60+ components)

Data Services

Service	Port	Runbook	Description
PostgreSQL	5432	postgresql.md	Relational data storage
Redis	6379	redis.md	Cache, sessions, pub/sub
Qdrant	6333	qdrant.md	Vector search and embeddings

Observability Services

Service	Port	Runbook	Description
Phoenix	6006	phoenix.md	LLM observability and tracing

Runbook Structure

Each runbook follows a consistent structure:

Overview - Purpose, port, health endpoint, dependencies
Common Issues - Symptoms, causes, resolution steps
Restart Procedure - Graceful and emergency restart steps
Logs Location - Where to find logs (Kubernetes and local)
Scaling - Horizontal and vertical scaling guidelines
Alerts - Critical and warning alert definitions

Quick Reference

Health Checks

# Agent services
curl http://localhost:3005/health  # agent-mesh
curl http://localhost:3006/health  # agent-brain
curl http://localhost:3007/health  # agent-router
curl http://localhost:3004/health  # workflow-engine
curl http://localhost:3002/health  # agent-tracer
curl http://localhost:3008/health  # agent-studio

# Data services
redis-cli -p 6379 ping
psql -h localhost -U postgres -c "SELECT 1"
curl http://localhost:6333/health  # qdrant

# Observability
curl http://localhost:6006/health  # phoenix

# CLI tools
buildkit status                    # agent-buildkit

Port Map

Port Range	Service Category
3002-3009	Agent Services
5432	PostgreSQL
6333	Qdrant Vector DB
6379	Redis
6006	Phoenix (LLM observability)
7687	Neo4j
9090	Prometheus
14268	Jaeger
16686	Jaeger UI

Emergency Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents
Escalation: See individual runbooks

Service Dependencies

agent-buildkit
  -> agent-mesh (coordination)
  -> agent-tracer (observability)
  -> GitLab API (CI/CD)

agent-mesh
  -> Redis (pub/sub)
  -> PostgreSQL (registry)
  -> Qdrant (capability matching)

agent-brain
  -> agent-router (LLM APIs)
  -> Qdrant (semantic memory)
  -> Redis (caching)
  -> Phoenix (tracing)

agent-router
  -> Redis (rate limiting)
  -> Phoenix (tracing)
  -> LLM APIs (Anthropic, OpenAI, Ollama)

agent-tracer
  -> Phoenix (LLM traces)
  -> Jaeger (distributed traces)
  -> Neo4j (correlation)
  -> Qdrant (embeddings)

agent-studio
  -> agent-mesh (backend)
  -> agent-brain (knowledge)
  -> agent-tracer (observability)
  -> Ollama (local models)

workflow-engine
  -> Redis (task queue)
  -> PostgreSQL (workflow storage)
  -> agent-mesh (coordination)
  -> Langflow (visual builder)

gitlab-components
  -> GitLab CI/CD
  -> semantic-release

Contributing

When updating runbooks:

Follow the existing template structure
Include specific commands that can be copy-pasted
Document both graceful and emergency procedures
Keep alert thresholds current with monitoring configuration
Test all commands before documenting
Update this README when adding new runbooks