Skip to main content

Agent Mesh Runbook

Agent Mesh Runbook

Runbook: agent-mesh project wiki - Runbooks

Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems.

For ecosystem patterns: See Agent Mesh Architecture

Separation of Duties: See Separation of Duties

  • Purpose: Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems. Enables agents to find, communicate, and collaborate across the platform.
  • Port: 3005 (API server), 3015 (gRPC)
  • Health endpoint: GET /health or GET /api/v1/health
  • Namespace: mesh (Kubernetes)
  • Technology: Node.js/TypeScript with gRPC
  • Package: @bluefly/agent-mesh
  • CRITICAL: Agents MUST work without home computer. Service runs on always-on infrastructure (Vast.ai or dedicated server) accessible via Tailscale MagicDNS: agent-mesh.tailcf98b3.ts.net:3005

Dependencies

  • Redis (port 6379) - Service registry and pub/sub
  • PostgreSQL (port 5432) - Agent metadata and state
  • Agent Tracer (port 3002) - Distributed tracing
  • Consul/etcd (optional) - Service discovery backend
  • Prometheus (port 9090) - Metrics collection

Core Components

ComponentPortDescription
Mesh API3005REST API server
gRPC Server3015High-performance agent communication
Discovery ServiceN/AAgent registration and discovery
RouterN/AMessage routing between agents
Health MonitorN/AAgent health checks
Load BalancerN/ARequest distribution

Common Issues

Issue 1: Agent Discovery Failures

  • Symptoms:
    • Agents cannot find each other
    • "Agent not found" errors
    • Empty agent listings
  • Cause:
    • Redis service registry unavailable
    • Agent registration expired
    • Network partitioning
  • Resolution:
    # Check mesh health (via Tailscale MagicDNS) curl http://agent-mesh.tailcf98b3.ts.net:3005/health # List registered agents curl http://agent-mesh.tailcf98b3.ts.net:3005/api/v1/agents # Check Redis connectivity redis-cli ping redis-cli keys "mesh:agent:*" # Force re-registration of agents curl -X POST http://localhost:3005/api/v1/agents/reregister # Check discovery service status curl http://localhost:3005/api/v1/discovery/status # Restart discovery service kubectl rollout restart deployment/agent-mesh -n mesh

Issue 2: Message Routing Failures

  • Symptoms:
    • Messages not delivered between agents
    • "Routing failed" errors
    • High message latency
  • Cause:
    • Target agent unhealthy
    • Routing table stale
    • Queue overflow
  • Resolution:
    # Check routing table curl http://localhost:3005/api/v1/routing/table # View message queue stats curl http://localhost:3005/api/v1/queues/stats # Clear routing cache curl -X POST http://localhost:3005/api/v1/routing/cache/clear # Rebuild routing table curl -X POST http://localhost:3005/api/v1/routing/rebuild # Check dead letter queue curl http://localhost:3005/api/v1/queues/dlq # Retry failed messages curl -X POST http://localhost:3005/api/v1/queues/dlq/retry # View routing metrics curl http://localhost:3005/api/v1/metrics/routing

Issue 3: Mesh Health Check Failures

  • Symptoms:
    • Agents marked unhealthy incorrectly
    • Frequent agent flapping
    • Health checks timing out
  • Cause:
    • Aggressive health check settings
    • Network latency spikes
    • Agent overloaded
  • Resolution:
    # View current health status curl http://localhost:3005/api/v1/health/all # Check health check configuration curl http://localhost:3005/api/v1/config/health-checks # Update health check intervals curl -X PUT http://localhost:3005/api/v1/config \ -H "Content-Type: application/json" \ -d '{"health_check_interval_ms": 30000, "health_check_timeout_ms": 10000}' # View agent health history curl http://localhost:3005/api/v1/agents/{agent_id}/health/history # Manually mark agent healthy curl -X PUT http://localhost:3005/api/v1/agents/{agent_id}/health \ -H "Content-Type: application/json" \ -d '{"status": "healthy"}' # Reset health state curl -X POST http://localhost:3005/api/v1/health/reset

Issue 4: gRPC Connection Issues

  • Symptoms:
    • gRPC calls failing
    • "Connection refused" on port 3015
    • Streaming connections dropping
  • Cause:
    • gRPC server not running
    • TLS configuration issues
    • Connection pool exhausted
  • Resolution:
    # Check gRPC health grpcurl -plaintext localhost:3015 grpc.health.v1.Health/Check # List gRPC services grpcurl -plaintext localhost:3015 list # Check connection pool curl http://localhost:3005/api/v1/grpc/connections # Reset connection pool curl -X POST http://localhost:3005/api/v1/grpc/connections/reset # Verify TLS certificates openssl s_client -connect localhost:3015 -showcerts # Restart gRPC server kubectl rollout restart deployment/agent-mesh-grpc -n mesh

Issue 5: High Memory/CPU Usage

  • Symptoms:
    • Mesh pods OOMKilled
    • Slow response times
    • CPU >90% continuously
  • Cause:
    • Too many connected agents
    • Message queue buildup
    • Memory leak in routing
  • Resolution:
    # Check resource usage kubectl top pods -n mesh # View connected agents count curl http://localhost:3005/api/v1/agents/count # Check queue depths curl http://localhost:3005/api/v1/queues/depth # Purge old messages curl -X DELETE http://localhost:3005/api/v1/queues/purge?older_than=1h # Increase resources kubectl set resources deployment/agent-mesh -n mesh \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=500m,memory=1Gi # Enable garbage collection curl -X POST http://localhost:3005/api/v1/gc/run # Restart to clear memory kubectl rollout restart deployment/agent-mesh -n mesh

Issue 6: Network Partition Recovery

  • Symptoms:
    • Split-brain scenarios
    • Inconsistent agent views
    • Duplicate messages
  • Cause:
    • Network partition between mesh nodes
    • Redis cluster split
    • DNS resolution issues
  • Resolution:
    # Check mesh cluster status curl http://localhost:3005/api/v1/cluster/status # View mesh node connectivity curl http://localhost:3005/api/v1/cluster/nodes # Force cluster reconciliation curl -X POST http://localhost:3005/api/v1/cluster/reconcile # Check Redis cluster health redis-cli cluster info # Elect new leader if needed curl -X POST http://localhost:3005/api/v1/cluster/leader/elect # Purge duplicate messages curl -X POST http://localhost:3005/api/v1/messages/dedupe # Full mesh resync curl -X POST http://localhost:3005/api/v1/cluster/resync

Restart Procedure

# 1. Drain connections curl -X POST http://localhost:3005/api/v1/drain # 2. Wait for active requests to complete while [ $(curl -s http://localhost:3005/api/v1/connections/active | jq '.count') -gt 0 ]; do sleep 5 done # 3. Rolling restart kubectl rollout restart deployment/agent-mesh -n mesh # 4. Monitor rollout kubectl rollout status deployment/agent-mesh -n mesh # 5. Verify mesh health curl http://localhost:3005/health curl http://localhost:3005/api/v1/agents

Emergency Restart

# Force kill all pods kubectl delete pods -n mesh -l app=agent-mesh --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=agent-mesh -n mesh --timeout=120s # Rebuild routing table curl -X POST http://localhost:3005/api/v1/routing/rebuild # Re-register all agents curl -X POST http://localhost:3005/api/v1/agents/reregister

Local Development Restart

# Stop any running processes pkill -f "agent-mesh" || true # Start in development mode npm run dev # Start with debug logging DEBUG=mesh:* npm run dev # Start specific components npm run start:api npm run start:grpc npm run start:discovery

Docker Compose Restart

# Graceful restart docker compose restart agent-mesh # Force restart with rebuild docker compose down agent-mesh docker compose up -d --build agent-mesh # View logs docker compose logs -f agent-mesh

Logs Location

Kubernetes Logs

# Real-time logs kubectl logs -f deployment/agent-mesh -n mesh # Filter by level kubectl logs deployment/agent-mesh -n mesh | grep -E "ERROR|WARN" # All mesh pods kubectl logs -l app=agent-mesh -n mesh --all-containers # Export for analysis kubectl logs deployment/agent-mesh -n mesh > mesh-logs-$(date +%Y%m%d).txt

Local Logs

# Application logs tail -f logs/mesh.log # Discovery logs tail -f logs/discovery.log # Routing logs tail -f logs/routing.log # gRPC logs tail -f logs/grpc.log

Message Logs

# View recent messages curl http://localhost:3005/api/v1/messages?limit=100 # View failed messages curl http://localhost:3005/api/v1/messages/failed # Export message logs curl http://localhost:3005/api/v1/messages/export > messages.json

Scaling

Horizontal Scaling

# Scale mesh replicas kubectl scale deployment/agent-mesh --replicas=5 -n mesh # Enable HPA kubectl autoscale deployment/agent-mesh -n mesh \ --min=3 --max=10 --cpu-percent=70 # Scale gRPC servers kubectl scale deployment/agent-mesh-grpc --replicas=3 -n mesh

Vertical Scaling

# Increase resources kubectl set resources deployment/agent-mesh -n mesh \ --limits=cpu=4000m,memory=8Gi \ --requests=cpu=1000m,memory=2Gi

Scaling Guidelines

MetricThresholdAction
Connected Agents> 100/podAdd replica
Message Throughput> 1000/sScale horizontally
Memory Usage> 75%Add memory or replica
gRPC Connections> 500/podScale gRPC servers
Discovery Latency> 500msScale discovery service
Queue Depth> 10000Scale, increase consumers

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
MeshDown0 healthy pods for 2minEmergency Restart
DiscoveryFailureDiscovery service down 5minRestart, check Redis
NetworkPartitionCluster split detectedReconcile cluster
MessageQueueOverflowQueue >100k messagesScale, purge old messages

Warning Alerts (Slack)

AlertConditionRunbook Action
HighLatencyP99 > 1s for 5minScale, check network
AgentFlapping>10 status changes/minAdjust health checks
MemoryPressureMemory > 75% for 10minScale or restart
gRPCErrors>5% error rateCheck connections
DLQBacklogDLQ > 1000 messagesInvestigate failures

Prometheus Alert Rules

groups: - name: agent-mesh rules: - alert: AgentMeshDown expr: up{job="agent-mesh"} == 0 for: 2m labels: severity: critical annotations: summary: "Agent Mesh is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/agent-mesh" - alert: DiscoveryServiceDown expr: mesh_discovery_healthy == 0 for: 5m labels: severity: critical annotations: summary: "Mesh discovery service is down" - alert: HighMessageLatency expr: histogram_quantile(0.99, rate(mesh_message_latency_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "Mesh message latency high" - alert: AgentFlapping expr: rate(mesh_agent_status_changes_total[5m]) > 10 for: 5m labels: severity: warning annotations: summary: "Agent health status flapping" - alert: MessageQueueOverflow expr: mesh_queue_depth > 100000 for: 5m labels: severity: critical annotations: summary: "Message queue overflow" - alert: NetworkPartition expr: mesh_cluster_nodes < mesh_expected_nodes for: 5m labels: severity: critical annotations: summary: "Mesh network partition detected"

Monitoring Dashboards

  • Grafana - Agent Mesh: https://grafana.local/d/agent-mesh
  • Mesh Topology: http://localhost:3005/dashboard/topology
  • Agent Registry: http://localhost:3005/dashboard/agents
  • Message Flow: http://localhost:3005/dashboard/messages

CLI Command Reference

Agent Management

# List agents buildkit mesh agents list # Register agent buildkit mesh agents register --name my-agent --endpoint http://localhost:3001 # Deregister agent buildkit mesh agents deregister --id agent-123 # Check agent health buildkit mesh agents health --id agent-123

Discovery Operations

# Discover agents by capability buildkit mesh discover --capability llm-routing # Discover agents by namespace buildkit mesh discover --namespace production # Refresh discovery cache buildkit mesh discover --refresh

Message Operations

# Send message to agent buildkit mesh send --to agent-123 --payload '{"action": "test"}' # Broadcast to all agents buildkit mesh broadcast --payload '{"action": "refresh"}' # View message queue buildkit mesh queue status # Retry failed messages buildkit mesh queue retry --dlq

Cluster Operations

# View cluster status buildkit mesh cluster status # List cluster nodes buildkit mesh cluster nodes # Force leader election buildkit mesh cluster elect # Reconcile cluster buildkit mesh cluster reconcile

Contacts