Agent Issues

Troubleshooting guide for agent communication, orchestration, and execution problems.

Issue: Agent Not Responding

Symptoms

Agent health check fails
No response to mesh discovery
Timeout on agent communication
Agent shows as "offline" in dashboard

Cause

Agent process crashed
Network connectivity issues
Resource exhaustion (CPU/memory)
Configuration errors
Dependency service unavailable

Solution

# Check agent container status
docker ps -a | grep agent

# View agent logs
docker logs agent-<name> --tail 200 -f

# Verify health endpoint
curl -v http://localhost:3001/health

# Check agent mesh connectivity
buildkit agents discover

# Restart agent
docker restart agent-<name>

# Verify agent manifest
ossa validate .agents/<name>/manifest.json

# Check dependencies
docker exec agent-<name> ping postgres
docker exec agent-<name> ping redis

Prevention

Implement proper health checks
Set up monitoring and alerting
Configure automatic restart policies
Use circuit breakers for dependencies

Issue: Agent Mesh Discovery Failures

Symptoms

Agents cannot find each other
"No agents available" errors
Partial mesh connectivity
Routing failures

Cause

DNS resolution issues
Firewall blocking agent ports
Mesh coordinator down
Stale agent registration
Network partition

Solution

# Check mesh status
buildkit agents mesh-health

# Discover agents manually
buildkit agents discover --all

# Check routing status
buildkit agents routing-status

# Verify agent registration
docker exec mesh-coordinator cat /var/lib/mesh/agents.json

# Force re-registration
docker exec agent-<name> /app/scripts/register-mesh.sh

# Check network connectivity between agents
docker exec agent-a ping agent-b
docker exec agent-a nc -zv agent-b 3001

Prevention

Implement mesh health monitoring
Use heartbeat-based registration
Configure TTL for agent registrations
Set up redundant mesh coordinators

Issue: Agent Task Execution Failures

Symptoms

Tasks assigned but not executed
Partial task completion
Error responses from agents
Task queue growing

Cause

Invalid task payload
Missing required capabilities
Resource constraints
External API failures
Timeout exceeded

Solution

# Check task queue
docker exec redis redis-cli LRANGE agent:tasks:pending 0 -1

# View failed tasks
docker exec redis redis-cli LRANGE agent:tasks:failed 0 -1

# Check agent capabilities
curl http://localhost:3001/capabilities

# Validate task payload
buildkit task validate --file task.json

# Retry failed task
docker exec agent-<name> /app/scripts/retry-task.sh <task-id>

# Check external API connectivity
docker exec agent-<name> curl -v https://api.openai.com/v1/models

Prevention

Validate task payloads before queuing
Implement task timeout handling
Add retry logic with exponential backoff
Monitor task success/failure rates

Issue: Agent-to-Agent Communication Failures

Symptoms

Cross-agent requests timeout
"Connection refused" errors
Message delivery failures
Inconsistent agent state

Cause

Network policy blocking traffic
TLS certificate issues
Message serialization errors
Buffer overflow on receiver
Protocol version mismatch

Solution

# Test direct connectivity
docker exec agent-a curl -v http://agent-b:3001/api/ping

# Check TLS certificates
docker exec agent-a openssl s_client -connect agent-b:3001

# Verify message format
docker exec agent-a cat /var/log/agent/outgoing.log | tail -20

# Check protocol versions
docker exec agent-a cat /app/package.json | jq '.dependencies["@mesh/protocol"]'
docker exec agent-b cat /app/package.json | jq '.dependencies["@mesh/protocol"]'

# Coordinate message between agents using buildkit
buildkit agents coordinate --source agent-a --target agent-b --message '{"type": "ping"}'

Prevention

Use consistent protocol versions
Implement message validation
Set up tracing for inter-agent calls
Monitor latency between agents

Issue: Agent Memory Leaks

Symptoms

Agent memory usage growing over time
OOMKilled restarts
Degrading performance
Garbage collection pauses

Cause

Unbounded caches
Event listener leaks
Circular references
Large object retention
Connection pool exhaustion

Solution

# Monitor memory usage
docker stats agent-<name>

# Check for OOM events
docker inspect agent-<name> | jq '.[0].State.OOMKilled'

# Generate heap dump (Node.js)
docker exec agent-<name> kill -USR1 1
docker cp agent-<name>:/tmp/heapdump.heapsnapshot .

# Analyze with Chrome DevTools or clinic.js
npx clinic heapprofiler -- node app.js

# Force garbage collection (temporary fix)
docker exec agent-<name> node -e "global.gc()"

# Set memory limits
# In docker-compose.yml:
deploy:
  resources:
    limits:
      memory: 1G

Prevention

Use WeakMap/WeakSet for caches
Implement bounded caches with LRU eviction
Clean up event listeners properly
Profile memory regularly in development

Issue: Agent Configuration Drift

Symptoms

Different behavior across agent instances
Unexpected feature flag states
Configuration not matching manifest
Environment-specific bugs

Cause

Manual configuration changes
Cached configuration not refreshed
Partial deployment
Environment variable mismatches

Solution

# Export current configuration
docker exec agent-<name> cat /app/config/config.yaml

# Compare with expected
diff <(docker exec agent-<name> cat /app/config/config.yaml) config/agent-config.yaml

# Reload configuration
docker exec agent-<name> kill -HUP 1

# Verify environment variables
docker exec agent-<name> env | sort

# Redeploy from manifest
buildkit agents deploy .agents/<name>/manifest.json --force

# Sync configuration across agents
buildkit agents sync-config --all

Prevention

Use GitOps for all configuration
Implement configuration validation
Version all configuration changes
Monitor configuration checksums

Issue: Agent Rate Limiting

Symptoms

429 Too Many Requests errors
Slowdown in agent operations
Cascading failures
External API quota exceeded

Cause

Excessive API calls
Missing rate limit handling
Retry storms
Burst traffic patterns
Shared quota exhaustion

Solution

# Check rate limit status
docker exec agent-<name> cat /var/log/agent/rate-limits.log

# Verify rate limit configuration
docker exec agent-<name> cat /app/config/rate-limits.yaml

# Implement backoff
# In agent code:
const retry = require('retry');
const operation = retry.operation({
  retries: 5,
  factor: 2,
  minTimeout: 1000,
  maxTimeout: 60000,
});

# Monitor quota usage
curl -H "Authorization: Bearer $API_KEY" \
  https://api.openai.com/v1/usage

# Distribute load across API keys
buildkit agents balance-quotas

Prevention

Implement proper rate limiting
Use token bucket algorithms
Monitor API quota usage
Configure alerts at 80% quota

Issue: Agent Authentication Failures

Symptoms

"Unauthorized" errors
Token refresh failures
Session expiration
Inter-agent auth failures

Cause

Expired tokens
Invalid credentials
Certificate mismatch
Clock drift
Revoked access

Solution

# Verify token validity
docker exec agent-<name> /app/scripts/check-token.sh

# Refresh authentication
docker exec agent-<name> /app/scripts/refresh-auth.sh

# Check clock synchronization
docker exec agent-<name> date
date
# If different:
docker exec agent-<name> ntpdate -s time.google.com

# Verify certificates
docker exec agent-<name> openssl x509 -in /etc/ssl/agent.crt -noout -dates

# Rotate credentials
buildkit agents rotate-credentials --agent <name>

# Check token file permissions
docker exec agent-<name> ls -la /secrets/

Prevention

Implement automatic token refresh
Monitor token expiration
Use short-lived tokens with refresh
Set up clock synchronization

Issue: Agent Scaling Problems

Symptoms

Cannot spawn new agent instances
Load not distributed evenly
Resource contention
Inconsistent performance

Cause

Resource quota exceeded
Network address exhaustion
Storage limits
Registry rate limits
Orchestrator constraints

Solution

# Check current scale
kubectl get deployments -l app=agent

# View resource usage
kubectl top pods -l app=agent

# Scale manually
kubectl scale deployment agent-worker --replicas=5

# Check HPA status
kubectl get hpa agent-worker

# View scaling events
kubectl describe hpa agent-worker

# Adjust resource requests
# In deployment.yaml:
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

Prevention

Configure appropriate resource requests
Set up Horizontal Pod Autoscaler
Monitor resource utilization
Plan capacity for peak load

Issue: Agent State Synchronization

Symptoms

Inconsistent state across replicas
Lost updates
Race conditions
Stale data

Cause

No distributed locking
Cache invalidation issues
Event ordering problems
Network partitions
Eventual consistency delays

Solution

# Check Redis cluster status
docker exec redis redis-cli CLUSTER INFO

# Verify state consistency
docker exec agent-a curl localhost:3001/state | jq .
docker exec agent-b curl localhost:3001/state | jq .

# Force state sync
docker exec agent-<name> /app/scripts/sync-state.sh

# Implement distributed locking
# In agent code:
const lock = await redis.lock('resource:123', { ttl: 5000 });
try {
  // Critical section
} finally {
  await lock.unlock();
}

# Clear stale state
docker exec redis redis-cli FLUSHDB
docker restart agent-<name>

Prevention

Use Redis for distributed state
Implement proper locking mechanisms
Design for eventual consistency
Add state versioning

Back to CI/CD Issues | Database Issues

Agent Issues

Agent Issues

Issue: Agent Not Responding

Symptoms

Cause

Solution

Prevention

Issue: Agent Mesh Discovery Failures

Symptoms

Cause

Solution

Prevention

Issue: Agent Task Execution Failures

Symptoms

Cause

Solution

Prevention

Issue: Agent-to-Agent Communication Failures

Symptoms

Cause

Solution

Prevention

Issue: Agent Memory Leaks

Symptoms

Cause

Solution

Prevention

Issue: Agent Configuration Drift

Symptoms

Cause

Solution

Prevention

Issue: Agent Rate Limiting

Symptoms

Cause

Solution

Prevention

Issue: Agent Authentication Failures

Symptoms

Cause

Solution

Prevention

Issue: Agent Scaling Problems

Symptoms

Cause

Solution

Prevention

Issue: Agent State Synchronization

Symptoms

Cause

Solution

Prevention

Related Documentation