Agent Mesh Runbook
Agent Mesh Runbook
Runbook: agent-mesh project wiki - Runbooks
Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems.
For ecosystem patterns: See Agent Mesh Architecture
Separation of Duties: See Separation of Duties
- Purpose: Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems. Enables agents to find, communicate, and collaborate across the platform.
- Port: 3005 (API server), 3015 (gRPC)
- Health endpoint:
GET /healthorGET /api/v1/health - Namespace:
mesh(Kubernetes) - Technology: Node.js/TypeScript with gRPC
- Package:
@bluefly/agent-mesh - CRITICAL: Agents MUST work without home computer. Service runs on always-on infrastructure (Vast.ai or dedicated server) accessible via Tailscale MagicDNS:
agent-mesh.tailcf98b3.ts.net:3005
Dependencies
- Redis (port 6379) - Service registry and pub/sub
- PostgreSQL (port 5432) - Agent metadata and state
- Agent Tracer (port 3002) - Distributed tracing
- Consul/etcd (optional) - Service discovery backend
- Prometheus (port 9090) - Metrics collection
Core Components
| Component | Port | Description |
|---|---|---|
| Mesh API | 3005 | REST API server |
| gRPC Server | 3015 | High-performance agent communication |
| Discovery Service | N/A | Agent registration and discovery |
| Router | N/A | Message routing between agents |
| Health Monitor | N/A | Agent health checks |
| Load Balancer | N/A | Request distribution |
Common Issues
Issue 1: Agent Discovery Failures
- Symptoms:
- Agents cannot find each other
- "Agent not found" errors
- Empty agent listings
- Cause:
- Redis service registry unavailable
- Agent registration expired
- Network partitioning
- Resolution:
# Check mesh health (via Tailscale MagicDNS) curl http://agent-mesh.tailcf98b3.ts.net:3005/health # List registered agents curl http://agent-mesh.tailcf98b3.ts.net:3005/api/v1/agents # Check Redis connectivity redis-cli ping redis-cli keys "mesh:agent:*" # Force re-registration of agents curl -X POST http://localhost:3005/api/v1/agents/reregister # Check discovery service status curl http://localhost:3005/api/v1/discovery/status # Restart discovery service kubectl rollout restart deployment/agent-mesh -n mesh
Issue 2: Message Routing Failures
- Symptoms:
- Messages not delivered between agents
- "Routing failed" errors
- High message latency
- Cause:
- Target agent unhealthy
- Routing table stale
- Queue overflow
- Resolution:
# Check routing table curl http://localhost:3005/api/v1/routing/table # View message queue stats curl http://localhost:3005/api/v1/queues/stats # Clear routing cache curl -X POST http://localhost:3005/api/v1/routing/cache/clear # Rebuild routing table curl -X POST http://localhost:3005/api/v1/routing/rebuild # Check dead letter queue curl http://localhost:3005/api/v1/queues/dlq # Retry failed messages curl -X POST http://localhost:3005/api/v1/queues/dlq/retry # View routing metrics curl http://localhost:3005/api/v1/metrics/routing
Issue 3: Mesh Health Check Failures
- Symptoms:
- Agents marked unhealthy incorrectly
- Frequent agent flapping
- Health checks timing out
- Cause:
- Aggressive health check settings
- Network latency spikes
- Agent overloaded
- Resolution:
# View current health status curl http://localhost:3005/api/v1/health/all # Check health check configuration curl http://localhost:3005/api/v1/config/health-checks # Update health check intervals curl -X PUT http://localhost:3005/api/v1/config \ -H "Content-Type: application/json" \ -d '{"health_check_interval_ms": 30000, "health_check_timeout_ms": 10000}' # View agent health history curl http://localhost:3005/api/v1/agents/{agent_id}/health/history # Manually mark agent healthy curl -X PUT http://localhost:3005/api/v1/agents/{agent_id}/health \ -H "Content-Type: application/json" \ -d '{"status": "healthy"}' # Reset health state curl -X POST http://localhost:3005/api/v1/health/reset
Issue 4: gRPC Connection Issues
- Symptoms:
- gRPC calls failing
- "Connection refused" on port 3015
- Streaming connections dropping
- Cause:
- gRPC server not running
- TLS configuration issues
- Connection pool exhausted
- Resolution:
# Check gRPC health grpcurl -plaintext localhost:3015 grpc.health.v1.Health/Check # List gRPC services grpcurl -plaintext localhost:3015 list # Check connection pool curl http://localhost:3005/api/v1/grpc/connections # Reset connection pool curl -X POST http://localhost:3005/api/v1/grpc/connections/reset # Verify TLS certificates openssl s_client -connect localhost:3015 -showcerts # Restart gRPC server kubectl rollout restart deployment/agent-mesh-grpc -n mesh
Issue 5: High Memory/CPU Usage
- Symptoms:
- Mesh pods OOMKilled
- Slow response times
- CPU >90% continuously
- Cause:
- Too many connected agents
- Message queue buildup
- Memory leak in routing
- Resolution:
# Check resource usage kubectl top pods -n mesh # View connected agents count curl http://localhost:3005/api/v1/agents/count # Check queue depths curl http://localhost:3005/api/v1/queues/depth # Purge old messages curl -X DELETE http://localhost:3005/api/v1/queues/purge?older_than=1h # Increase resources kubectl set resources deployment/agent-mesh -n mesh \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=500m,memory=1Gi # Enable garbage collection curl -X POST http://localhost:3005/api/v1/gc/run # Restart to clear memory kubectl rollout restart deployment/agent-mesh -n mesh
Issue 6: Network Partition Recovery
- Symptoms:
- Split-brain scenarios
- Inconsistent agent views
- Duplicate messages
- Cause:
- Network partition between mesh nodes
- Redis cluster split
- DNS resolution issues
- Resolution:
# Check mesh cluster status curl http://localhost:3005/api/v1/cluster/status # View mesh node connectivity curl http://localhost:3005/api/v1/cluster/nodes # Force cluster reconciliation curl -X POST http://localhost:3005/api/v1/cluster/reconcile # Check Redis cluster health redis-cli cluster info # Elect new leader if needed curl -X POST http://localhost:3005/api/v1/cluster/leader/elect # Purge duplicate messages curl -X POST http://localhost:3005/api/v1/messages/dedupe # Full mesh resync curl -X POST http://localhost:3005/api/v1/cluster/resync
Restart Procedure
Graceful Restart (Recommended)
# 1. Drain connections curl -X POST http://localhost:3005/api/v1/drain # 2. Wait for active requests to complete while [ $(curl -s http://localhost:3005/api/v1/connections/active | jq '.count') -gt 0 ]; do sleep 5 done # 3. Rolling restart kubectl rollout restart deployment/agent-mesh -n mesh # 4. Monitor rollout kubectl rollout status deployment/agent-mesh -n mesh # 5. Verify mesh health curl http://localhost:3005/health curl http://localhost:3005/api/v1/agents
Emergency Restart
# Force kill all pods kubectl delete pods -n mesh -l app=agent-mesh --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=agent-mesh -n mesh --timeout=120s # Rebuild routing table curl -X POST http://localhost:3005/api/v1/routing/rebuild # Re-register all agents curl -X POST http://localhost:3005/api/v1/agents/reregister
Local Development Restart
# Stop any running processes pkill -f "agent-mesh" || true # Start in development mode npm run dev # Start with debug logging DEBUG=mesh:* npm run dev # Start specific components npm run start:api npm run start:grpc npm run start:discovery
Docker Compose Restart
# Graceful restart docker compose restart agent-mesh # Force restart with rebuild docker compose down agent-mesh docker compose up -d --build agent-mesh # View logs docker compose logs -f agent-mesh
Logs Location
Kubernetes Logs
# Real-time logs kubectl logs -f deployment/agent-mesh -n mesh # Filter by level kubectl logs deployment/agent-mesh -n mesh | grep -E "ERROR|WARN" # All mesh pods kubectl logs -l app=agent-mesh -n mesh --all-containers # Export for analysis kubectl logs deployment/agent-mesh -n mesh > mesh-logs-$(date +%Y%m%d).txt
Local Logs
# Application logs tail -f logs/mesh.log # Discovery logs tail -f logs/discovery.log # Routing logs tail -f logs/routing.log # gRPC logs tail -f logs/grpc.log
Message Logs
# View recent messages curl http://localhost:3005/api/v1/messages?limit=100 # View failed messages curl http://localhost:3005/api/v1/messages/failed # Export message logs curl http://localhost:3005/api/v1/messages/export > messages.json
Scaling
Horizontal Scaling
# Scale mesh replicas kubectl scale deployment/agent-mesh --replicas=5 -n mesh # Enable HPA kubectl autoscale deployment/agent-mesh -n mesh \ --min=3 --max=10 --cpu-percent=70 # Scale gRPC servers kubectl scale deployment/agent-mesh-grpc --replicas=3 -n mesh
Vertical Scaling
# Increase resources kubectl set resources deployment/agent-mesh -n mesh \ --limits=cpu=4000m,memory=8Gi \ --requests=cpu=1000m,memory=2Gi
Scaling Guidelines
| Metric | Threshold | Action |
|---|---|---|
| Connected Agents | > 100/pod | Add replica |
| Message Throughput | > 1000/s | Scale horizontally |
| Memory Usage | > 75% | Add memory or replica |
| gRPC Connections | > 500/pod | Scale gRPC servers |
| Discovery Latency | > 500ms | Scale discovery service |
| Queue Depth | > 10000 | Scale, increase consumers |
Alerts
Critical Alerts (PagerDuty)
| Alert | Condition | Runbook Action |
|---|---|---|
| MeshDown | 0 healthy pods for 2min | Emergency Restart |
| DiscoveryFailure | Discovery service down 5min | Restart, check Redis |
| NetworkPartition | Cluster split detected | Reconcile cluster |
| MessageQueueOverflow | Queue >100k messages | Scale, purge old messages |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|---|---|
| HighLatency | P99 > 1s for 5min | Scale, check network |
| AgentFlapping | >10 status changes/min | Adjust health checks |
| MemoryPressure | Memory > 75% for 10min | Scale or restart |
| gRPCErrors | >5% error rate | Check connections |
| DLQBacklog | DLQ > 1000 messages | Investigate failures |
Prometheus Alert Rules
groups: - name: agent-mesh rules: - alert: AgentMeshDown expr: up{job="agent-mesh"} == 0 for: 2m labels: severity: critical annotations: summary: "Agent Mesh is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/agent-mesh" - alert: DiscoveryServiceDown expr: mesh_discovery_healthy == 0 for: 5m labels: severity: critical annotations: summary: "Mesh discovery service is down" - alert: HighMessageLatency expr: histogram_quantile(0.99, rate(mesh_message_latency_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "Mesh message latency high" - alert: AgentFlapping expr: rate(mesh_agent_status_changes_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "Agent health status flapping" - alert: MessageQueueOverflow expr: mesh_queue_depth > 100000 for: 5m labels: severity: critical annotations: summary: "Message queue overflow" - alert: NetworkPartition expr: mesh_cluster_nodes < mesh_expected_nodes for: 5m labels: severity: critical annotations: summary: "Mesh network partition detected"
Monitoring Dashboards
- Grafana - Agent Mesh:
https://grafana.local/d/agent-mesh - Mesh Topology:
http://localhost:3005/dashboard/topology - Agent Registry:
http://localhost:3005/dashboard/agents - Message Flow:
http://localhost:3005/dashboard/messages
CLI Command Reference
Agent Management
# List agents buildkit mesh agents list # Register agent buildkit mesh agents register --name my-agent --endpoint http://localhost:3001 # Deregister agent buildkit mesh agents deregister --id agent-123 # Check agent health buildkit mesh agents health --id agent-123
Discovery Operations
# Discover agents by capability buildkit mesh discover --capability llm-routing # Discover agents by namespace buildkit mesh discover --namespace production # Refresh discovery cache buildkit mesh discover --refresh
Message Operations
# Send message to agent buildkit mesh send --to agent-123 --payload '{"action": "test"}' # Broadcast to all agents buildkit mesh broadcast --payload '{"action": "refresh"}' # View message queue buildkit mesh queue status # Retry failed messages buildkit mesh queue retry --dlq
Cluster Operations
# View cluster status buildkit mesh cluster status # List cluster nodes buildkit mesh cluster nodes # Force leader election buildkit mesh cluster elect # Reconcile cluster buildkit mesh cluster reconcile
Contacts
- On-call: PagerDuty rotation
- Slack: #platform-incidents, #agent-mesh
- Owner: Platform Team
- Repository: https://gitlab.com/blueflyio/llm/npm/agent-mesh
Related Runbooks
- Agent Router Runbook - LLM routing
- Agent Brain Runbook - State management
- Agent Tracer Runbook - Observability
- Agent BuildKit Runbook - CLI tools