README
Production Service Runbooks
Separation of Duties: See Separation of Duties - Runbooks document operational procedures. They do NOT own agent manifests, execution, or infrastructure configuration.
Vast.ai Integration: See BULLETPROOF_VASTAI_PLAN.md - Complete Vast.ai implementation plan.
This directory contains operational runbooks for all production services in the agent platform.
Agent Services
| Service | Port | Runbook | Description |
|---|---|---|---|
| Agent BuildKit | N/A (CLI) | agent-buildkit.md | Agent lifecycle management and CLI toolkit |
| Agent Mesh | 3005 | agent-mesh.md | Distributed coordination and gRPC server |
| Agent Brain | 3006 | agent-brain.md | Knowledge graph and vector database |
| Agent Router | 3007 | agent-router.md | LLM routing and load balancing |
| Agent Tracer | 3002 | agent-tracer.md | Observability and tracing (ACE/ATLAS) |
| Agent Studio | 3008 | agent-studio.md | IDE suite and development environment |
| Workflow Engine | 3004 | workflow-engine.md | Workflow orchestration and Langflow bridge |
CI/CD Components
| Service | Port | Runbook | Description |
|---|---|---|---|
| GitLab Components | N/A | gitlab-components.md | CI/CD component library (60+ components) |
Data Services
| Service | Port | Runbook | Description |
|---|---|---|---|
| PostgreSQL | 5432 | postgresql.md | Relational data storage |
| Redis | 6379 | redis.md | Cache, sessions, pub/sub |
| Qdrant | 6333 | qdrant.md | Vector search and embeddings |
Observability Services
| Service | Port | Runbook | Description |
|---|---|---|---|
| Phoenix | 6006 | phoenix.md | LLM observability and tracing |
Runbook Structure
Each runbook follows a consistent structure:
- Overview - Purpose, port, health endpoint, dependencies
- Common Issues - Symptoms, causes, resolution steps
- Restart Procedure - Graceful and emergency restart steps
- Logs Location - Where to find logs (Kubernetes and local)
- Scaling - Horizontal and vertical scaling guidelines
- Alerts - Critical and warning alert definitions
Quick Reference
Health Checks
# Agent services curl http://localhost:3005/health # agent-mesh curl http://localhost:3006/health # agent-brain curl http://localhost:3007/health # agent-router curl http://localhost:3004/health # workflow-engine curl http://localhost:3002/health # agent-tracer curl http://localhost:3008/health # agent-studio # Data services redis-cli -p 6379 ping psql -h localhost -U postgres -c "SELECT 1" curl http://localhost:6333/health # qdrant # Observability curl http://localhost:6006/health # phoenix # CLI tools buildkit status # agent-buildkit
Port Map
| Port Range | Service Category |
|---|---|
| 3002-3009 | Agent Services |
| 5432 | PostgreSQL |
| 6333 | Qdrant Vector DB |
| 6379 | Redis |
| 6006 | Phoenix (LLM observability) |
| 7687 | Neo4j |
| 9090 | Prometheus |
| 14268 | Jaeger |
| 16686 | Jaeger UI |
Emergency Contacts
- On-call: PagerDuty rotation
- Slack: #platform-incidents
- Escalation: See individual runbooks
Service Dependencies
agent-buildkit
-> agent-mesh (coordination)
-> agent-tracer (observability)
-> GitLab API (CI/CD)
agent-mesh
-> Redis (pub/sub)
-> PostgreSQL (registry)
-> Qdrant (capability matching)
agent-brain
-> agent-router (LLM APIs)
-> Qdrant (semantic memory)
-> Redis (caching)
-> Phoenix (tracing)
agent-router
-> Redis (rate limiting)
-> Phoenix (tracing)
-> LLM APIs (Anthropic, OpenAI, Ollama)
agent-tracer
-> Phoenix (LLM traces)
-> Jaeger (distributed traces)
-> Neo4j (correlation)
-> Qdrant (embeddings)
agent-studio
-> agent-mesh (backend)
-> agent-brain (knowledge)
-> agent-tracer (observability)
-> Ollama (local models)
workflow-engine
-> Redis (task queue)
-> PostgreSQL (workflow storage)
-> agent-mesh (coordination)
-> Langflow (visual builder)
gitlab-components
-> GitLab CI/CD
-> semantic-release
Contributing
When updating runbooks:
- Follow the existing template structure
- Include specific commands that can be copy-pasted
- Document both graceful and emergency procedures
- Keep alert thresholds current with monitoring configuration
- Test all commands before documenting
- Update this README when adding new runbooks