Skip to main content

Agent Issues

Agent Issues

Troubleshooting guide for agent communication, orchestration, and execution problems.


Issue: Agent Not Responding

Symptoms

  • Agent health check fails
  • No response to mesh discovery
  • Timeout on agent communication
  • Agent shows as "offline" in dashboard

Cause

  1. Agent process crashed
  2. Network connectivity issues
  3. Resource exhaustion (CPU/memory)
  4. Configuration errors
  5. Dependency service unavailable

Solution

# Check agent container status docker ps -a | grep agent # View agent logs docker logs agent-<name> --tail 200 -f # Verify health endpoint curl -v http://localhost:3001/health # Check agent mesh connectivity buildkit agents discover # Restart agent docker restart agent-<name> # Verify agent manifest ossa validate .agents/<name>/manifest.json # Check dependencies docker exec agent-<name> ping postgres docker exec agent-<name> ping redis

Prevention

  • Implement proper health checks
  • Set up monitoring and alerting
  • Configure automatic restart policies
  • Use circuit breakers for dependencies

Issue: Agent Mesh Discovery Failures

Symptoms

  • Agents cannot find each other
  • "No agents available" errors
  • Partial mesh connectivity
  • Routing failures

Cause

  1. DNS resolution issues
  2. Firewall blocking agent ports
  3. Mesh coordinator down
  4. Stale agent registration
  5. Network partition

Solution

# Check mesh status buildkit agents mesh-health # Discover agents manually buildkit agents discover --all # Check routing status buildkit agents routing-status # Verify agent registration docker exec mesh-coordinator cat /var/lib/mesh/agents.json # Force re-registration docker exec agent-<name> /app/scripts/register-mesh.sh # Check network connectivity between agents docker exec agent-a ping agent-b docker exec agent-a nc -zv agent-b 3001

Prevention

  • Implement mesh health monitoring
  • Use heartbeat-based registration
  • Configure TTL for agent registrations
  • Set up redundant mesh coordinators

Issue: Agent Task Execution Failures

Symptoms

  • Tasks assigned but not executed
  • Partial task completion
  • Error responses from agents
  • Task queue growing

Cause

  1. Invalid task payload
  2. Missing required capabilities
  3. Resource constraints
  4. External API failures
  5. Timeout exceeded

Solution

# Check task queue docker exec redis redis-cli LRANGE agent:tasks:pending 0 -1 # View failed tasks docker exec redis redis-cli LRANGE agent:tasks:failed 0 -1 # Check agent capabilities curl http://localhost:3001/capabilities # Validate task payload buildkit task validate --file task.json # Retry failed task docker exec agent-<name> /app/scripts/retry-task.sh <task-id> # Check external API connectivity docker exec agent-<name> curl -v https://api.openai.com/v1/models

Prevention

  • Validate task payloads before queuing
  • Implement task timeout handling
  • Add retry logic with exponential backoff
  • Monitor task success/failure rates

Issue: Agent-to-Agent Communication Failures

Symptoms

  • Cross-agent requests timeout
  • "Connection refused" errors
  • Message delivery failures
  • Inconsistent agent state

Cause

  1. Network policy blocking traffic
  2. TLS certificate issues
  3. Message serialization errors
  4. Buffer overflow on receiver
  5. Protocol version mismatch

Solution

# Test direct connectivity docker exec agent-a curl -v http://agent-b:3001/api/ping # Check TLS certificates docker exec agent-a openssl s_client -connect agent-b:3001 # Verify message format docker exec agent-a cat /var/log/agent/outgoing.log | tail -20 # Check protocol versions docker exec agent-a cat /app/package.json | jq '.dependencies["@mesh/protocol"]' docker exec agent-b cat /app/package.json | jq '.dependencies["@mesh/protocol"]' # Coordinate message between agents using buildkit buildkit agents coordinate --source agent-a --target agent-b --message '{"type": "ping"}'

Prevention

  • Use consistent protocol versions
  • Implement message validation
  • Set up tracing for inter-agent calls
  • Monitor latency between agents

Issue: Agent Memory Leaks

Symptoms

  • Agent memory usage growing over time
  • OOMKilled restarts
  • Degrading performance
  • Garbage collection pauses

Cause

  1. Unbounded caches
  2. Event listener leaks
  3. Circular references
  4. Large object retention
  5. Connection pool exhaustion

Solution

# Monitor memory usage docker stats agent-<name> # Check for OOM events docker inspect agent-<name> | jq '.[0].State.OOMKilled' # Generate heap dump (Node.js) docker exec agent-<name> kill -USR1 1 docker cp agent-<name>:/tmp/heapdump.heapsnapshot . # Analyze with Chrome DevTools or clinic.js npx clinic heapprofiler -- node app.js # Force garbage collection (temporary fix) docker exec agent-<name> node -e "global.gc()" # Set memory limits # In docker-compose.yml: deploy: resources: limits: memory: 1G

Prevention

  • Use WeakMap/WeakSet for caches
  • Implement bounded caches with LRU eviction
  • Clean up event listeners properly
  • Profile memory regularly in development

Issue: Agent Configuration Drift

Symptoms

  • Different behavior across agent instances
  • Unexpected feature flag states
  • Configuration not matching manifest
  • Environment-specific bugs

Cause

  1. Manual configuration changes
  2. Cached configuration not refreshed
  3. Partial deployment
  4. Environment variable mismatches

Solution

# Export current configuration docker exec agent-<name> cat /app/config/config.yaml # Compare with expected diff <(docker exec agent-<name> cat /app/config/config.yaml) config/agent-config.yaml # Reload configuration docker exec agent-<name> kill -HUP 1 # Verify environment variables docker exec agent-<name> env | sort # Redeploy from manifest buildkit agents deploy .agents/<name>/manifest.json --force # Sync configuration across agents buildkit agents sync-config --all

Prevention

  • Use GitOps for all configuration
  • Implement configuration validation
  • Version all configuration changes
  • Monitor configuration checksums

Issue: Agent Rate Limiting

Symptoms

  • 429 Too Many Requests errors
  • Slowdown in agent operations
  • Cascading failures
  • External API quota exceeded

Cause

  1. Excessive API calls
  2. Missing rate limit handling
  3. Retry storms
  4. Burst traffic patterns
  5. Shared quota exhaustion

Solution

# Check rate limit status docker exec agent-<name> cat /var/log/agent/rate-limits.log # Verify rate limit configuration docker exec agent-<name> cat /app/config/rate-limits.yaml # Implement backoff # In agent code: const retry = require('retry'); const operation = retry.operation({ retries: 5, factor: 2, minTimeout: 1000, maxTimeout: 60000, }); # Monitor quota usage curl -H "Authorization: Bearer $API_KEY" \ https://api.openai.com/v1/usage # Distribute load across API keys buildkit agents balance-quotas

Prevention

  • Implement proper rate limiting
  • Use token bucket algorithms
  • Monitor API quota usage
  • Configure alerts at 80% quota

Issue: Agent Authentication Failures

Symptoms

  • "Unauthorized" errors
  • Token refresh failures
  • Session expiration
  • Inter-agent auth failures

Cause

  1. Expired tokens
  2. Invalid credentials
  3. Certificate mismatch
  4. Clock drift
  5. Revoked access

Solution

# Verify token validity docker exec agent-<name> /app/scripts/check-token.sh # Refresh authentication docker exec agent-<name> /app/scripts/refresh-auth.sh # Check clock synchronization docker exec agent-<name> date date # If different: docker exec agent-<name> ntpdate -s time.google.com # Verify certificates docker exec agent-<name> openssl x509 -in /etc/ssl/agent.crt -noout -dates # Rotate credentials buildkit agents rotate-credentials --agent <name> # Check token file permissions docker exec agent-<name> ls -la /secrets/

Prevention

  • Implement automatic token refresh
  • Monitor token expiration
  • Use short-lived tokens with refresh
  • Set up clock synchronization

Issue: Agent Scaling Problems

Symptoms

  • Cannot spawn new agent instances
  • Load not distributed evenly
  • Resource contention
  • Inconsistent performance

Cause

  1. Resource quota exceeded
  2. Network address exhaustion
  3. Storage limits
  4. Registry rate limits
  5. Orchestrator constraints

Solution

# Check current scale kubectl get deployments -l app=agent # View resource usage kubectl top pods -l app=agent # Scale manually kubectl scale deployment agent-worker --replicas=5 # Check HPA status kubectl get hpa agent-worker # View scaling events kubectl describe hpa agent-worker # Adjust resource requests # In deployment.yaml: resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi

Prevention

  • Configure appropriate resource requests
  • Set up Horizontal Pod Autoscaler
  • Monitor resource utilization
  • Plan capacity for peak load

Issue: Agent State Synchronization

Symptoms

  • Inconsistent state across replicas
  • Lost updates
  • Race conditions
  • Stale data

Cause

  1. No distributed locking
  2. Cache invalidation issues
  3. Event ordering problems
  4. Network partitions
  5. Eventual consistency delays

Solution

# Check Redis cluster status docker exec redis redis-cli CLUSTER INFO # Verify state consistency docker exec agent-a curl localhost:3001/state | jq . docker exec agent-b curl localhost:3001/state | jq . # Force state sync docker exec agent-<name> /app/scripts/sync-state.sh # Implement distributed locking # In agent code: const lock = await redis.lock('resource:123', { ttl: 5000 }); try { // Critical section } finally { await lock.unlock(); } # Clear stale state docker exec redis redis-cli FLUSHDB docker restart agent-<name>

Prevention

  • Use Redis for distributed state
  • Implement proper locking mechanisms
  • Design for eventual consistency
  • Add state versioning


Back to CI/CD Issues | Database Issues