Agent Issues
Agent Issues
Troubleshooting guide for agent communication, orchestration, and execution problems.
Issue: Agent Not Responding
Symptoms
- Agent health check fails
- No response to mesh discovery
- Timeout on agent communication
- Agent shows as "offline" in dashboard
Cause
- Agent process crashed
- Network connectivity issues
- Resource exhaustion (CPU/memory)
- Configuration errors
- Dependency service unavailable
Solution
# Check agent container status docker ps -a | grep agent # View agent logs docker logs agent-<name> --tail 200 -f # Verify health endpoint curl -v http://localhost:3001/health # Check agent mesh connectivity buildkit agents discover # Restart agent docker restart agent-<name> # Verify agent manifest ossa validate .agents/<name>/manifest.json # Check dependencies docker exec agent-<name> ping postgres docker exec agent-<name> ping redis
Prevention
- Implement proper health checks
- Set up monitoring and alerting
- Configure automatic restart policies
- Use circuit breakers for dependencies
Issue: Agent Mesh Discovery Failures
Symptoms
- Agents cannot find each other
- "No agents available" errors
- Partial mesh connectivity
- Routing failures
Cause
- DNS resolution issues
- Firewall blocking agent ports
- Mesh coordinator down
- Stale agent registration
- Network partition
Solution
# Check mesh status buildkit agents mesh-health # Discover agents manually buildkit agents discover --all # Check routing status buildkit agents routing-status # Verify agent registration docker exec mesh-coordinator cat /var/lib/mesh/agents.json # Force re-registration docker exec agent-<name> /app/scripts/register-mesh.sh # Check network connectivity between agents docker exec agent-a ping agent-b docker exec agent-a nc -zv agent-b 3001
Prevention
- Implement mesh health monitoring
- Use heartbeat-based registration
- Configure TTL for agent registrations
- Set up redundant mesh coordinators
Issue: Agent Task Execution Failures
Symptoms
- Tasks assigned but not executed
- Partial task completion
- Error responses from agents
- Task queue growing
Cause
- Invalid task payload
- Missing required capabilities
- Resource constraints
- External API failures
- Timeout exceeded
Solution
# Check task queue docker exec redis redis-cli LRANGE agent:tasks:pending 0 -1 # View failed tasks docker exec redis redis-cli LRANGE agent:tasks:failed 0 -1 # Check agent capabilities curl http://localhost:3001/capabilities # Validate task payload buildkit task validate --file task.json # Retry failed task docker exec agent-<name> /app/scripts/retry-task.sh <task-id> # Check external API connectivity docker exec agent-<name> curl -v https://api.openai.com/v1/models
Prevention
- Validate task payloads before queuing
- Implement task timeout handling
- Add retry logic with exponential backoff
- Monitor task success/failure rates
Issue: Agent-to-Agent Communication Failures
Symptoms
- Cross-agent requests timeout
- "Connection refused" errors
- Message delivery failures
- Inconsistent agent state
Cause
- Network policy blocking traffic
- TLS certificate issues
- Message serialization errors
- Buffer overflow on receiver
- Protocol version mismatch
Solution
# Test direct connectivity docker exec agent-a curl -v http://agent-b:3001/api/ping # Check TLS certificates docker exec agent-a openssl s_client -connect agent-b:3001 # Verify message format docker exec agent-a cat /var/log/agent/outgoing.log | tail -20 # Check protocol versions docker exec agent-a cat /app/package.json | jq '.dependencies["@mesh/protocol"]' docker exec agent-b cat /app/package.json | jq '.dependencies["@mesh/protocol"]' # Coordinate message between agents using buildkit buildkit agents coordinate --source agent-a --target agent-b --message '{"type": "ping"}'
Prevention
- Use consistent protocol versions
- Implement message validation
- Set up tracing for inter-agent calls
- Monitor latency between agents
Issue: Agent Memory Leaks
Symptoms
- Agent memory usage growing over time
- OOMKilled restarts
- Degrading performance
- Garbage collection pauses
Cause
- Unbounded caches
- Event listener leaks
- Circular references
- Large object retention
- Connection pool exhaustion
Solution
# Monitor memory usage docker stats agent-<name> # Check for OOM events docker inspect agent-<name> | jq '.[0].State.OOMKilled' # Generate heap dump (Node.js) docker exec agent-<name> kill -USR1 1 docker cp agent-<name>:/tmp/heapdump.heapsnapshot . # Analyze with Chrome DevTools or clinic.js npx clinic heapprofiler -- node app.js # Force garbage collection (temporary fix) docker exec agent-<name> node -e "global.gc()" # Set memory limits # In docker-compose.yml: deploy: resources: limits: memory: 1G
Prevention
- Use WeakMap/WeakSet for caches
- Implement bounded caches with LRU eviction
- Clean up event listeners properly
- Profile memory regularly in development
Issue: Agent Configuration Drift
Symptoms
- Different behavior across agent instances
- Unexpected feature flag states
- Configuration not matching manifest
- Environment-specific bugs
Cause
- Manual configuration changes
- Cached configuration not refreshed
- Partial deployment
- Environment variable mismatches
Solution
# Export current configuration docker exec agent-<name> cat /app/config/config.yaml # Compare with expected diff <(docker exec agent-<name> cat /app/config/config.yaml) config/agent-config.yaml # Reload configuration docker exec agent-<name> kill -HUP 1 # Verify environment variables docker exec agent-<name> env | sort # Redeploy from manifest buildkit agents deploy .agents/<name>/manifest.json --force # Sync configuration across agents buildkit agents sync-config --all
Prevention
- Use GitOps for all configuration
- Implement configuration validation
- Version all configuration changes
- Monitor configuration checksums
Issue: Agent Rate Limiting
Symptoms
- 429 Too Many Requests errors
- Slowdown in agent operations
- Cascading failures
- External API quota exceeded
Cause
- Excessive API calls
- Missing rate limit handling
- Retry storms
- Burst traffic patterns
- Shared quota exhaustion
Solution
# Check rate limit status docker exec agent-<name> cat /var/log/agent/rate-limits.log # Verify rate limit configuration docker exec agent-<name> cat /app/config/rate-limits.yaml # Implement backoff # In agent code: const retry = require('retry'); const operation = retry.operation({ retries: 5, factor: 2, minTimeout: 1000, maxTimeout: 60000, }); # Monitor quota usage curl -H "Authorization: Bearer $API_KEY" \ https://api.openai.com/v1/usage # Distribute load across API keys buildkit agents balance-quotas
Prevention
- Implement proper rate limiting
- Use token bucket algorithms
- Monitor API quota usage
- Configure alerts at 80% quota
Issue: Agent Authentication Failures
Symptoms
- "Unauthorized" errors
- Token refresh failures
- Session expiration
- Inter-agent auth failures
Cause
- Expired tokens
- Invalid credentials
- Certificate mismatch
- Clock drift
- Revoked access
Solution
# Verify token validity docker exec agent-<name> /app/scripts/check-token.sh # Refresh authentication docker exec agent-<name> /app/scripts/refresh-auth.sh # Check clock synchronization docker exec agent-<name> date date # If different: docker exec agent-<name> ntpdate -s time.google.com # Verify certificates docker exec agent-<name> openssl x509 -in /etc/ssl/agent.crt -noout -dates # Rotate credentials buildkit agents rotate-credentials --agent <name> # Check token file permissions docker exec agent-<name> ls -la /secrets/
Prevention
- Implement automatic token refresh
- Monitor token expiration
- Use short-lived tokens with refresh
- Set up clock synchronization
Issue: Agent Scaling Problems
Symptoms
- Cannot spawn new agent instances
- Load not distributed evenly
- Resource contention
- Inconsistent performance
Cause
- Resource quota exceeded
- Network address exhaustion
- Storage limits
- Registry rate limits
- Orchestrator constraints
Solution
# Check current scale kubectl get deployments -l app=agent # View resource usage kubectl top pods -l app=agent # Scale manually kubectl scale deployment agent-worker --replicas=5 # Check HPA status kubectl get hpa agent-worker # View scaling events kubectl describe hpa agent-worker # Adjust resource requests # In deployment.yaml: resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512Mi
Prevention
- Configure appropriate resource requests
- Set up Horizontal Pod Autoscaler
- Monitor resource utilization
- Plan capacity for peak load
Issue: Agent State Synchronization
Symptoms
- Inconsistent state across replicas
- Lost updates
- Race conditions
- Stale data
Cause
- No distributed locking
- Cache invalidation issues
- Event ordering problems
- Network partitions
- Eventual consistency delays
Solution
# Check Redis cluster status docker exec redis redis-cli CLUSTER INFO # Verify state consistency docker exec agent-a curl localhost:3001/state | jq . docker exec agent-b curl localhost:3001/state | jq . # Force state sync docker exec agent-<name> /app/scripts/sync-state.sh # Implement distributed locking # In agent code: const lock = await redis.lock('resource:123', { ttl: 5000 }); try { // Critical section } finally { await lock.unlock(); } # Clear stale state docker exec redis redis-cli FLUSHDB docker restart agent-<name>
Prevention
- Use Redis for distributed state
- Implement proper locking mechanisms
- Design for eventual consistency
- Add state versioning