Skip to main content

README

Production Service Runbooks

Separation of Duties: See Separation of Duties - Runbooks document operational procedures. They do NOT own agent manifests, execution, or infrastructure configuration.

Vast.ai Integration: See BULLETPROOF_VASTAI_PLAN.md - Complete Vast.ai implementation plan.

This directory contains operational runbooks for all production services in the agent platform.

Agent Services

ServicePortRunbookDescription
Agent BuildKitN/A (CLI)agent-buildkit.mdAgent lifecycle management and CLI toolkit
Agent Mesh3005agent-mesh.mdDistributed coordination and gRPC server
Agent Brain3006agent-brain.mdKnowledge graph and vector database
Agent Router3007agent-router.mdLLM routing and load balancing
Agent Tracer3002agent-tracer.mdObservability and tracing (ACE/ATLAS)
Agent Studio3008agent-studio.mdIDE suite and development environment
Workflow Engine3004workflow-engine.mdWorkflow orchestration and Langflow bridge

CI/CD Components

ServicePortRunbookDescription
GitLab ComponentsN/Agitlab-components.mdCI/CD component library (60+ components)

Data Services

ServicePortRunbookDescription
PostgreSQL5432postgresql.mdRelational data storage
Redis6379redis.mdCache, sessions, pub/sub
Qdrant6333qdrant.mdVector search and embeddings

Observability Services

ServicePortRunbookDescription
Phoenix6006phoenix.mdLLM observability and tracing

Runbook Structure

Each runbook follows a consistent structure:

  1. Overview - Purpose, port, health endpoint, dependencies
  2. Common Issues - Symptoms, causes, resolution steps
  3. Restart Procedure - Graceful and emergency restart steps
  4. Logs Location - Where to find logs (Kubernetes and local)
  5. Scaling - Horizontal and vertical scaling guidelines
  6. Alerts - Critical and warning alert definitions

Quick Reference

Health Checks

# Agent services curl http://localhost:3005/health # agent-mesh curl http://localhost:3006/health # agent-brain curl http://localhost:3007/health # agent-router curl http://localhost:3004/health # workflow-engine curl http://localhost:3002/health # agent-tracer curl http://localhost:3008/health # agent-studio # Data services redis-cli -p 6379 ping psql -h localhost -U postgres -c "SELECT 1" curl http://localhost:6333/health # qdrant # Observability curl http://localhost:6006/health # phoenix # CLI tools buildkit status # agent-buildkit

Port Map

Port RangeService Category
3002-3009Agent Services
5432PostgreSQL
6333Qdrant Vector DB
6379Redis
6006Phoenix (LLM observability)
7687Neo4j
9090Prometheus
14268Jaeger
16686Jaeger UI

Emergency Contacts

  • On-call: PagerDuty rotation
  • Slack: #platform-incidents
  • Escalation: See individual runbooks

Service Dependencies

agent-buildkit
  -> agent-mesh (coordination)
  -> agent-tracer (observability)
  -> GitLab API (CI/CD)

agent-mesh
  -> Redis (pub/sub)
  -> PostgreSQL (registry)
  -> Qdrant (capability matching)

agent-brain
  -> agent-router (LLM APIs)
  -> Qdrant (semantic memory)
  -> Redis (caching)
  -> Phoenix (tracing)

agent-router
  -> Redis (rate limiting)
  -> Phoenix (tracing)
  -> LLM APIs (Anthropic, OpenAI, Ollama)

agent-tracer
  -> Phoenix (LLM traces)
  -> Jaeger (distributed traces)
  -> Neo4j (correlation)
  -> Qdrant (embeddings)

agent-studio
  -> agent-mesh (backend)
  -> agent-brain (knowledge)
  -> agent-tracer (observability)
  -> Ollama (local models)

workflow-engine
  -> Redis (task queue)
  -> PostgreSQL (workflow storage)
  -> agent-mesh (coordination)
  -> Langflow (visual builder)

gitlab-components
  -> GitLab CI/CD
  -> semantic-release

Contributing

When updating runbooks:

  1. Follow the existing template structure
  2. Include specific commands that can be copy-pasted
  3. Document both graceful and emergency procedures
  4. Keep alert thresholds current with monitoring configuration
  5. Test all commands before documenting
  6. Update this README when adding new runbooks