Skip to main content

Common Platform Issues

Common Platform Issues

General troubleshooting guide for the BlueFly.io Agent Platform.


Issue: Service Not Starting

Symptoms

  • Service container exits immediately after starting
  • Health check endpoints return 503
  • Logs show "connection refused" or "port already in use"
  • OrbStack shows container in restart loop

Cause

  1. Port conflict with another service
  2. Missing environment variables
  3. Dependency service not ready
  4. Insufficient memory allocation
  5. Configuration file syntax errors

Solution

# Check port conflicts lsof -i :3000 # Replace with service port # Verify environment variables docker exec <container> env | grep -E "^(DATABASE|REDIS|API)" # Check service logs docker logs <container> --tail 100 # Verify dependencies are running docker ps | grep -E "(postgres|redis|qdrant)" # Restart with fresh state docker compose down && docker compose up -d

Prevention

  • Use port mapping validation in CI/CD
  • Document all required environment variables
  • Implement proper health checks with retry logic
  • Use Docker Compose depends_on with condition: service_healthy

Issue: Memory Exhaustion

Symptoms

  • Services killed unexpectedly (OOMKilled)
  • Slow response times across platform
  • Mac system becomes unresponsive
  • OrbStack shows high memory usage

Cause

  1. Memory leaks in long-running services
  2. Too many services running simultaneously
  3. Large dataset processing without streaming
  4. Unbounded caching

Solution

# Check memory usage by container docker stats --no-stream # Identify OOMKilled containers docker inspect <container> | jq '.[0].State.OOMKilled' # Increase container memory limits # In docker-compose.yml: deploy: resources: limits: memory: 2G reservations: memory: 512M # Clear Docker cache docker system prune -a --volumes # Restart OrbStack to reclaim memory orb restart

Prevention

  • Set memory limits for all containers
  • Implement streaming for large data operations
  • Use bounded caches with TTL
  • Monitor memory usage with Prometheus/Grafana
  • Schedule regular container restarts for leaky services

Issue: Network Connectivity Failures

Symptoms

  • Services cannot reach each other
  • DNS resolution fails within containers
  • External API calls timeout
  • Tailscale connectivity issues

Cause

  1. Docker network not created
  2. DNS configuration issues
  3. Firewall blocking traffic
  4. Tailscale not connected
  5. VPN conflicts

Solution

# Verify Docker networks docker network ls docker network inspect blueflyio_default # Test DNS resolution docker exec <container> nslookup postgres docker exec <container> nslookup api.openai.com # Check Tailscale status tailscale status tailscale ping 100.108.129.7 # Mac M4 tailscale ping 100.108.180.36 # Mac M3 # Restart Docker networking docker network prune docker compose up -d # Reset Tailscale tailscale down && tailscale up

Prevention

  • Use service names instead of IP addresses
  • Implement connection retry logic with exponential backoff
  • Monitor network health with ping checks
  • Document network dependencies in service manifests

Issue: Disk Space Exhaustion

Symptoms

  • Build failures with "no space left on device"
  • Database writes failing
  • Log files not rotating
  • Container creation fails

Cause

  1. Docker images/volumes accumulating
  2. Log files not rotated
  3. Build artifacts not cleaned
  4. Database WAL files growing

Solution

# Check disk usage df -h # Docker cleanup docker system prune -a --volumes docker volume prune # Clean build artifacts rm -rf node_modules/.cache rm -rf .next/cache rm -rf vendor/cache # PostgreSQL WAL cleanup docker exec postgres psql -c "SELECT pg_current_wal_lsn();" # Verify replication is caught up, then: docker exec postgres psql -c "CHECKPOINT;" # Clear old logs find /var/log -name "*.log" -mtime +7 -delete

Prevention

  • Implement log rotation (logrotate or Docker logging driver)
  • Schedule weekly Docker cleanup jobs
  • Set WAL retention policies in PostgreSQL
  • Monitor disk usage with alerts at 80% threshold

Issue: SSL/TLS Certificate Errors

Symptoms

  • "Certificate has expired" errors
  • "Unable to verify certificate" warnings
  • HTTPS connections failing
  • Browser security warnings

Cause

  1. Let's Encrypt certificates expired
  2. Self-signed certificates not trusted
  3. Certificate chain incomplete
  4. System clock drift

Solution

# Check certificate expiry echo | openssl s_client -connect localhost:443 2>/dev/null | openssl x509 -noout -dates # Renew Let's Encrypt certificates certbot renew --dry-run certbot renew # Verify certificate chain openssl s_client -connect localhost:443 -showcerts # Fix system clock sudo sntp -sS time.apple.com

Prevention

  • Set up automatic certificate renewal
  • Monitor certificate expiry (30 days warning)
  • Use cert-manager in Kubernetes
  • Implement certificate expiry alerts

Issue: Configuration Drift

Symptoms

  • Different behavior between environments
  • "Works on my machine" issues
  • Unexpected feature flags enabled/disabled
  • Database schema mismatches

Cause

  1. Manual configuration changes not committed
  2. Environment-specific overrides
  3. Cached configuration not refreshed
  4. Feature flags out of sync

Solution

# Export current configuration docker exec <service> cat /app/config.yaml > current-config.yaml # Compare with repository diff current-config.yaml config/production.yaml # Refresh configuration docker exec <service> kill -HUP 1 # Reload config # Sync feature flags buildkit config sync --env production # Reset to known state docker compose down && git checkout . && docker compose up -d

Prevention

  • Use GitOps for all configuration
  • Implement configuration validation in CI/CD
  • Use immutable configuration patterns
  • Document all configuration options

Issue: Build Failures

Symptoms

  • CI/CD pipeline failures
  • Local builds succeed but remote fails
  • Dependency resolution errors
  • Docker build cache issues

Cause

  1. Missing build dependencies
  2. Network timeouts during package download
  3. Incompatible package versions
  4. Docker layer cache corruption

Solution

# Clear local caches rm -rf node_modules package-lock.json npm install # For PHP/Composer rm -rf vendor composer.lock composer install # Docker build without cache docker build --no-cache -t <image> . # Check GitLab CI runner resources # In .gitlab-ci.yml, add: variables: GIT_CLEAN_FLAGS: "-ffdx"

Prevention

  • Pin all dependency versions
  • Use lock files and commit them
  • Implement build caching strategies
  • Set appropriate timeouts for network operations

Issue: Performance Degradation

Symptoms

  • API response times increasing
  • Database queries timing out
  • High CPU usage
  • Memory pressure alerts

Cause

  1. N+1 query problems
  2. Missing database indexes
  3. Inefficient algorithms
  4. Resource contention

Solution

# Profile API endpoints curl -w "@curl-format.txt" -o /dev/null -s "http://localhost:3000/api/endpoint" # Check slow queries (PostgreSQL) docker exec postgres psql -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;" # Monitor resource usage docker stats # Profile application node --prof app.js node --prof-process isolate-*.log > processed.txt

Prevention

  • Implement query analyzers in development
  • Add database indexes proactively
  • Use APM tools (New Relic, Datadog)
  • Set up performance budgets in CI/CD


Back to Troubleshooting Home | CI/CD Issues