Agent Mesh Runbook

Runbook: agent-mesh project wiki - Runbooks

Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems.

For ecosystem patterns: See Agent Mesh Architecture

Separation of Duties: See Separation of Duties

Purpose: Distributed agent-to-agent communication fabric providing service discovery, message routing, health monitoring, and coordination for multi-agent systems. Enables agents to find, communicate, and collaborate across the platform.
Port: 3005 (API server), 3015 (gRPC)
Health endpoint: GET /health or GET /api/v1/health
Namespace: mesh (Kubernetes)
Technology: Node.js/TypeScript with gRPC
Package: @bluefly/agent-mesh
CRITICAL: Agents MUST work without home computer. Service runs on always-on infrastructure (Vast.ai or dedicated server) accessible via Tailscale MagicDNS: agent-mesh.tailcf98b3.ts.net:3005

Dependencies

Redis (port 6379) - Service registry and pub/sub
PostgreSQL (port 5432) - Agent metadata and state
Agent Tracer (port 3002) - Distributed tracing
Consul/etcd (optional) - Service discovery backend
Prometheus (port 9090) - Metrics collection

Core Components

Component	Port	Description
Mesh API	3005	REST API server
gRPC Server	3015	High-performance agent communication
Discovery Service	N/A	Agent registration and discovery
Router	N/A	Message routing between agents
Health Monitor	N/A	Agent health checks
Load Balancer	N/A	Request distribution

Common Issues

Issue 1: Agent Discovery Failures

Symptoms:
- Agents cannot find each other
- "Agent not found" errors
- Empty agent listings
Cause:
- Redis service registry unavailable
- Agent registration expired
- Network partitioning

Resolution:

# Check mesh health (via Tailscale MagicDNS)
curl http://agent-mesh.tailcf98b3.ts.net:3005/health

# List registered agents
curl http://agent-mesh.tailcf98b3.ts.net:3005/api/v1/agents

# Check Redis connectivity
redis-cli ping
redis-cli keys "mesh:agent:*"

# Force re-registration of agents
curl -X POST http://localhost:3005/api/v1/agents/reregister

# Check discovery service status
curl http://localhost:3005/api/v1/discovery/status

# Restart discovery service
kubectl rollout restart deployment/agent-mesh -n mesh

Issue 2: Message Routing Failures

Symptoms:
- Messages not delivered between agents
- "Routing failed" errors
- High message latency
Cause:
- Target agent unhealthy
- Routing table stale
- Queue overflow

Resolution:

# Check routing table
curl http://localhost:3005/api/v1/routing/table

# View message queue stats
curl http://localhost:3005/api/v1/queues/stats

# Clear routing cache
curl -X POST http://localhost:3005/api/v1/routing/cache/clear

# Rebuild routing table
curl -X POST http://localhost:3005/api/v1/routing/rebuild

# Check dead letter queue
curl http://localhost:3005/api/v1/queues/dlq

# Retry failed messages
curl -X POST http://localhost:3005/api/v1/queues/dlq/retry

# View routing metrics
curl http://localhost:3005/api/v1/metrics/routing

Issue 3: Mesh Health Check Failures

Symptoms:
- Agents marked unhealthy incorrectly
- Frequent agent flapping
- Health checks timing out
Cause:
- Aggressive health check settings
- Network latency spikes
- Agent overloaded

Resolution:

# View current health status
curl http://localhost:3005/api/v1/health/all

# Check health check configuration
curl http://localhost:3005/api/v1/config/health-checks

# Update health check intervals
curl -X PUT http://localhost:3005/api/v1/config \
  -H "Content-Type: application/json" \
  -d '{"health_check_interval_ms": 30000, "health_check_timeout_ms": 10000}'

# View agent health history
curl http://localhost:3005/api/v1/agents/{agent_id}/health/history

# Manually mark agent healthy
curl -X PUT http://localhost:3005/api/v1/agents/{agent_id}/health \
  -H "Content-Type: application/json" \
  -d '{"status": "healthy"}'

# Reset health state
curl -X POST http://localhost:3005/api/v1/health/reset

Issue 4: gRPC Connection Issues

Symptoms:
- gRPC calls failing
- "Connection refused" on port 3015
- Streaming connections dropping
Cause:
- gRPC server not running
- TLS configuration issues
- Connection pool exhausted

Resolution:

# Check gRPC health
grpcurl -plaintext localhost:3015 grpc.health.v1.Health/Check

# List gRPC services
grpcurl -plaintext localhost:3015 list

# Check connection pool
curl http://localhost:3005/api/v1/grpc/connections

# Reset connection pool
curl -X POST http://localhost:3005/api/v1/grpc/connections/reset

# Verify TLS certificates
openssl s_client -connect localhost:3015 -showcerts

# Restart gRPC server
kubectl rollout restart deployment/agent-mesh-grpc -n mesh

Issue 5: High Memory/CPU Usage

Symptoms:
- Mesh pods OOMKilled
- Slow response times
- CPU >90% continuously
Cause:
- Too many connected agents
- Message queue buildup
- Memory leak in routing

Resolution:

# Check resource usage
kubectl top pods -n mesh

# View connected agents count
curl http://localhost:3005/api/v1/agents/count

# Check queue depths
curl http://localhost:3005/api/v1/queues/depth

# Purge old messages
curl -X DELETE http://localhost:3005/api/v1/queues/purge?older_than=1h

# Increase resources
kubectl set resources deployment/agent-mesh -n mesh \
  --limits=cpu=2000m,memory=4Gi \
  --requests=cpu=500m,memory=1Gi

# Enable garbage collection
curl -X POST http://localhost:3005/api/v1/gc/run

# Restart to clear memory
kubectl rollout restart deployment/agent-mesh -n mesh

Issue 6: Network Partition Recovery

Symptoms:
- Split-brain scenarios
- Inconsistent agent views
- Duplicate messages
Cause:
- Network partition between mesh nodes
- Redis cluster split
- DNS resolution issues

Resolution:

# Check mesh cluster status
curl http://localhost:3005/api/v1/cluster/status

# View mesh node connectivity
curl http://localhost:3005/api/v1/cluster/nodes

# Force cluster reconciliation
curl -X POST http://localhost:3005/api/v1/cluster/reconcile

# Check Redis cluster health
redis-cli cluster info

# Elect new leader if needed
curl -X POST http://localhost:3005/api/v1/cluster/leader/elect

# Purge duplicate messages
curl -X POST http://localhost:3005/api/v1/messages/dedupe

# Full mesh resync
curl -X POST http://localhost:3005/api/v1/cluster/resync

Restart Procedure

Graceful Restart (Recommended)

# 1. Drain connections
curl -X POST http://localhost:3005/api/v1/drain

# 2. Wait for active requests to complete
while [ $(curl -s http://localhost:3005/api/v1/connections/active | jq '.count') -gt 0 ]; do
  sleep 5
done

# 3. Rolling restart
kubectl rollout restart deployment/agent-mesh -n mesh

# 4. Monitor rollout
kubectl rollout status deployment/agent-mesh -n mesh

# 5. Verify mesh health
curl http://localhost:3005/health
curl http://localhost:3005/api/v1/agents

Emergency Restart

# Force kill all pods
kubectl delete pods -n mesh -l app=agent-mesh --force

# Wait for recovery
kubectl wait --for=condition=ready pod -l app=agent-mesh -n mesh --timeout=120s

# Rebuild routing table
curl -X POST http://localhost:3005/api/v1/routing/rebuild

# Re-register all agents
curl -X POST http://localhost:3005/api/v1/agents/reregister

Local Development Restart

# Stop any running processes
pkill -f "agent-mesh" || true

# Start in development mode
npm run dev

# Start with debug logging
DEBUG=mesh:* npm run dev

# Start specific components
npm run start:api
npm run start:grpc
npm run start:discovery

Docker Compose Restart

# Graceful restart
docker compose restart agent-mesh

# Force restart with rebuild
docker compose down agent-mesh
docker compose up -d --build agent-mesh

# View logs
docker compose logs -f agent-mesh

Logs Location

Kubernetes Logs

# Real-time logs
kubectl logs -f deployment/agent-mesh -n mesh

# Filter by level
kubectl logs deployment/agent-mesh -n mesh | grep -E "ERROR|WARN"

# All mesh pods
kubectl logs -l app=agent-mesh -n mesh --all-containers

# Export for analysis
kubectl logs deployment/agent-mesh -n mesh > mesh-logs-$(date +%Y%m%d).txt

Local Logs

# Application logs
tail -f logs/mesh.log

# Discovery logs
tail -f logs/discovery.log

# Routing logs
tail -f logs/routing.log

# gRPC logs
tail -f logs/grpc.log

Message Logs

# View recent messages
curl http://localhost:3005/api/v1/messages?limit=100

# View failed messages
curl http://localhost:3005/api/v1/messages/failed

# Export message logs
curl http://localhost:3005/api/v1/messages/export > messages.json

Scaling

Horizontal Scaling

# Scale mesh replicas
kubectl scale deployment/agent-mesh --replicas=5 -n mesh

# Enable HPA
kubectl autoscale deployment/agent-mesh -n mesh \
  --min=3 --max=10 --cpu-percent=70

# Scale gRPC servers
kubectl scale deployment/agent-mesh-grpc --replicas=3 -n mesh

Vertical Scaling

# Increase resources
kubectl set resources deployment/agent-mesh -n mesh \
  --limits=cpu=4000m,memory=8Gi \
  --requests=cpu=1000m,memory=2Gi

Scaling Guidelines

Metric	Threshold	Action
Connected Agents	> 100/pod	Add replica
Message Throughput	> 1000/s	Scale horizontally
Memory Usage	> 75%	Add memory or replica
gRPC Connections	> 500/pod	Scale gRPC servers
Discovery Latency	> 500ms	Scale discovery service
Queue Depth	> 10000	Scale, increase consumers

Alerts

Critical Alerts (PagerDuty)

Alert	Condition	Runbook Action
MeshDown	0 healthy pods for 2min	Emergency Restart
DiscoveryFailure	Discovery service down 5min	Restart, check Redis
NetworkPartition	Cluster split detected	Reconcile cluster
MessageQueueOverflow	Queue >100k messages	Scale, purge old messages

Warning Alerts (Slack)

Alert	Condition	Runbook Action
HighLatency	P99 > 1s for 5min	Scale, check network
AgentFlapping	>10 status changes/min	Adjust health checks
MemoryPressure	Memory > 75% for 10min	Scale or restart
gRPCErrors	>5% error rate	Check connections
DLQBacklog	DLQ > 1000 messages	Investigate failures

Prometheus Alert Rules

groups:
  - name: agent-mesh
    rules:
      - alert: AgentMeshDown
        expr: up{job="agent-mesh"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Agent Mesh is down"
          runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/agent-mesh"

      - alert: DiscoveryServiceDown
        expr: mesh_discovery_healthy == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Mesh discovery service is down"

      - alert: HighMessageLatency
        expr: histogram_quantile(0.99, rate(mesh_message_latency_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Mesh message latency high"

      - alert: AgentFlapping
        expr: rate(mesh_agent_status_changes_total[5m]) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent health status flapping"

      - alert: MessageQueueOverflow
        expr: mesh_queue_depth > 100000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Message queue overflow"

      - alert: NetworkPartition
        expr: mesh_cluster_nodes < mesh_expected_nodes
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Mesh network partition detected"

Monitoring Dashboards

Grafana - Agent Mesh: https://grafana.local/d/agent-mesh
Mesh Topology: http://localhost:3005/dashboard/topology
Agent Registry: http://localhost:3005/dashboard/agents
Message Flow: http://localhost:3005/dashboard/messages

CLI Command Reference

Agent Management

# List agents
buildkit mesh agents list

# Register agent
buildkit mesh agents register --name my-agent --endpoint http://localhost:3001

# Deregister agent
buildkit mesh agents deregister --id agent-123

# Check agent health
buildkit mesh agents health --id agent-123

Discovery Operations

# Discover agents by capability
buildkit mesh discover --capability llm-routing

# Discover agents by namespace
buildkit mesh discover --namespace production

# Refresh discovery cache
buildkit mesh discover --refresh

Message Operations

# Send message to agent
buildkit mesh send --to agent-123 --payload '{"action": "test"}'

# Broadcast to all agents
buildkit mesh broadcast --payload '{"action": "refresh"}'

# View message queue
buildkit mesh queue status

# Retry failed messages
buildkit mesh queue retry --dlq

Cluster Operations

# View cluster status
buildkit mesh cluster status

# List cluster nodes
buildkit mesh cluster nodes

# Force leader election
buildkit mesh cluster elect

# Reconcile cluster
buildkit mesh cluster reconcile

Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents, #agent-mesh
Owner: Platform Team
Repository: https://gitlab.com/blueflyio/llm/npm/agent-mesh

Agent Router Runbook - LLM routing
Agent Brain Runbook - State management
Agent Tracer Runbook - Observability
Agent BuildKit Runbook - CLI tools

Agent Mesh Runbook

Agent Mesh Runbook

Dependencies

Core Components

Common Issues

Issue 1: Agent Discovery Failures

Issue 2: Message Routing Failures

Issue 3: Mesh Health Check Failures

Issue 4: gRPC Connection Issues

Issue 5: High Memory/CPU Usage

Issue 6: Network Partition Recovery

Restart Procedure

Graceful Restart (Recommended)

Emergency Restart

Local Development Restart

Docker Compose Restart

Logs Location

Kubernetes Logs

Local Logs

Message Logs

Scaling

Horizontal Scaling

Vertical Scaling

Scaling Guidelines

Alerts

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Prometheus Alert Rules

Monitoring Dashboards

CLI Command Reference

Agent Management

Discovery Operations

Message Operations

Cluster Operations

Contacts

Related Runbooks