Common Platform Issues
Common Platform Issues
General troubleshooting guide for the BlueFly.io Agent Platform.
Issue: Service Not Starting
Symptoms
- Service container exits immediately after starting
- Health check endpoints return 503
- Logs show "connection refused" or "port already in use"
- OrbStack shows container in restart loop
Cause
- Port conflict with another service
- Missing environment variables
- Dependency service not ready
- Insufficient memory allocation
- Configuration file syntax errors
Solution
# Check port conflicts lsof -i :3000 # Replace with service port # Verify environment variables docker exec <container> env | grep -E "^(DATABASE|REDIS|API)" # Check service logs docker logs <container> --tail 100 # Verify dependencies are running docker ps | grep -E "(postgres|redis|qdrant)" # Restart with fresh state docker compose down && docker compose up -d
Prevention
- Use port mapping validation in CI/CD
- Document all required environment variables
- Implement proper health checks with retry logic
- Use Docker Compose depends_on with condition: service_healthy
Issue: Memory Exhaustion
Symptoms
- Services killed unexpectedly (OOMKilled)
- Slow response times across platform
- Mac system becomes unresponsive
- OrbStack shows high memory usage
Cause
- Memory leaks in long-running services
- Too many services running simultaneously
- Large dataset processing without streaming
- Unbounded caching
Solution
# Check memory usage by container docker stats --no-stream # Identify OOMKilled containers docker inspect <container> | jq '.[0].State.OOMKilled' # Increase container memory limits # In docker-compose.yml: deploy: resources: limits: memory: 2G reservations: memory: 512M # Clear Docker cache docker system prune -a --volumes # Restart OrbStack to reclaim memory orb restart
Prevention
- Set memory limits for all containers
- Implement streaming for large data operations
- Use bounded caches with TTL
- Monitor memory usage with Prometheus/Grafana
- Schedule regular container restarts for leaky services
Issue: Network Connectivity Failures
Symptoms
- Services cannot reach each other
- DNS resolution fails within containers
- External API calls timeout
- Tailscale connectivity issues
Cause
- Docker network not created
- DNS configuration issues
- Firewall blocking traffic
- Tailscale not connected
- VPN conflicts
Solution
# Verify Docker networks docker network ls docker network inspect blueflyio_default # Test DNS resolution docker exec <container> nslookup postgres docker exec <container> nslookup api.openai.com # Check Tailscale status tailscale status tailscale ping 100.108.129.7 # Mac M4 tailscale ping 100.108.180.36 # Mac M3 # Restart Docker networking docker network prune docker compose up -d # Reset Tailscale tailscale down && tailscale up
Prevention
- Use service names instead of IP addresses
- Implement connection retry logic with exponential backoff
- Monitor network health with ping checks
- Document network dependencies in service manifests
Issue: Disk Space Exhaustion
Symptoms
- Build failures with "no space left on device"
- Database writes failing
- Log files not rotating
- Container creation fails
Cause
- Docker images/volumes accumulating
- Log files not rotated
- Build artifacts not cleaned
- Database WAL files growing
Solution
# Check disk usage df -h # Docker cleanup docker system prune -a --volumes docker volume prune # Clean build artifacts rm -rf node_modules/.cache rm -rf .next/cache rm -rf vendor/cache # PostgreSQL WAL cleanup docker exec postgres psql -c "SELECT pg_current_wal_lsn();" # Verify replication is caught up, then: docker exec postgres psql -c "CHECKPOINT;" # Clear old logs find /var/log -name "*.log" -mtime +7 -delete
Prevention
- Implement log rotation (logrotate or Docker logging driver)
- Schedule weekly Docker cleanup jobs
- Set WAL retention policies in PostgreSQL
- Monitor disk usage with alerts at 80% threshold
Issue: SSL/TLS Certificate Errors
Symptoms
- "Certificate has expired" errors
- "Unable to verify certificate" warnings
- HTTPS connections failing
- Browser security warnings
Cause
- Let's Encrypt certificates expired
- Self-signed certificates not trusted
- Certificate chain incomplete
- System clock drift
Solution
# Check certificate expiry echo | openssl s_client -connect localhost:443 2>/dev/null | openssl x509 -noout -dates # Renew Let's Encrypt certificates certbot renew --dry-run certbot renew # Verify certificate chain openssl s_client -connect localhost:443 -showcerts # Fix system clock sudo sntp -sS time.apple.com
Prevention
- Set up automatic certificate renewal
- Monitor certificate expiry (30 days warning)
- Use cert-manager in Kubernetes
- Implement certificate expiry alerts
Issue: Configuration Drift
Symptoms
- Different behavior between environments
- "Works on my machine" issues
- Unexpected feature flags enabled/disabled
- Database schema mismatches
Cause
- Manual configuration changes not committed
- Environment-specific overrides
- Cached configuration not refreshed
- Feature flags out of sync
Solution
# Export current configuration docker exec <service> cat /app/config.yaml > current-config.yaml # Compare with repository diff current-config.yaml config/production.yaml # Refresh configuration docker exec <service> kill -HUP 1 # Reload config # Sync feature flags buildkit config sync --env production # Reset to known state docker compose down && git checkout . && docker compose up -d
Prevention
- Use GitOps for all configuration
- Implement configuration validation in CI/CD
- Use immutable configuration patterns
- Document all configuration options
Issue: Build Failures
Symptoms
- CI/CD pipeline failures
- Local builds succeed but remote fails
- Dependency resolution errors
- Docker build cache issues
Cause
- Missing build dependencies
- Network timeouts during package download
- Incompatible package versions
- Docker layer cache corruption
Solution
# Clear local caches rm -rf node_modules package-lock.json npm install # For PHP/Composer rm -rf vendor composer.lock composer install # Docker build without cache docker build --no-cache -t <image> . # Check GitLab CI runner resources # In .gitlab-ci.yml, add: variables: GIT_CLEAN_FLAGS: "-ffdx"
Prevention
- Pin all dependency versions
- Use lock files and commit them
- Implement build caching strategies
- Set appropriate timeouts for network operations
Issue: Performance Degradation
Symptoms
- API response times increasing
- Database queries timing out
- High CPU usage
- Memory pressure alerts
Cause
- N+1 query problems
- Missing database indexes
- Inefficient algorithms
- Resource contention
Solution
# Profile API endpoints curl -w "@curl-format.txt" -o /dev/null -s "http://localhost:3000/api/endpoint" # Check slow queries (PostgreSQL) docker exec postgres psql -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;" # Monitor resource usage docker stats # Profile application node --prof app.js node --prof-process isolate-*.log > processed.txt
Prevention
- Implement query analyzers in development
- Add database indexes proactively
- Use APM tools (New Relic, Datadog)
- Set up performance budgets in CI/CD