Common Platform Issues

General troubleshooting guide for the BlueFly.io Agent Platform.

Issue: Service Not Starting

Symptoms

Service container exits immediately after starting
Health check endpoints return 503
Logs show "connection refused" or "port already in use"
OrbStack shows container in restart loop

Cause

Port conflict with another service
Missing environment variables
Dependency service not ready
Insufficient memory allocation
Configuration file syntax errors

Solution

# Check port conflicts
lsof -i :3000  # Replace with service port

# Verify environment variables
docker exec <container> env | grep -E "^(DATABASE|REDIS|API)"

# Check service logs
docker logs <container> --tail 100

# Verify dependencies are running
docker ps | grep -E "(postgres|redis|qdrant)"

# Restart with fresh state
docker compose down && docker compose up -d

Prevention

Use port mapping validation in CI/CD
Document all required environment variables
Implement proper health checks with retry logic
Use Docker Compose depends_on with condition: service_healthy

Issue: Memory Exhaustion

Symptoms

Services killed unexpectedly (OOMKilled)
Slow response times across platform
Mac system becomes unresponsive
OrbStack shows high memory usage

Cause

Memory leaks in long-running services
Too many services running simultaneously
Large dataset processing without streaming
Unbounded caching

Solution

# Check memory usage by container
docker stats --no-stream

# Identify OOMKilled containers
docker inspect <container> | jq '.[0].State.OOMKilled'

# Increase container memory limits
# In docker-compose.yml:
deploy:
  resources:
    limits:
      memory: 2G
    reservations:
      memory: 512M

# Clear Docker cache
docker system prune -a --volumes

# Restart OrbStack to reclaim memory
orb restart

Prevention

Set memory limits for all containers
Implement streaming for large data operations
Use bounded caches with TTL
Monitor memory usage with Prometheus/Grafana
Schedule regular container restarts for leaky services

Issue: Network Connectivity Failures

Symptoms

Services cannot reach each other
DNS resolution fails within containers
External API calls timeout
Tailscale connectivity issues

Cause

Docker network not created
DNS configuration issues
Firewall blocking traffic
Tailscale not connected
VPN conflicts

Solution

# Verify Docker networks
docker network ls
docker network inspect blueflyio_default

# Test DNS resolution
docker exec <container> nslookup postgres
docker exec <container> nslookup api.openai.com

# Check Tailscale status
tailscale status
tailscale ping 100.108.129.7  # Mac M4
tailscale ping 100.108.180.36 # Mac M3

# Restart Docker networking
docker network prune
docker compose up -d

# Reset Tailscale
tailscale down && tailscale up

Prevention

Use service names instead of IP addresses
Implement connection retry logic with exponential backoff
Monitor network health with ping checks
Document network dependencies in service manifests

Issue: Disk Space Exhaustion

Symptoms

Build failures with "no space left on device"
Database writes failing
Log files not rotating
Container creation fails

Cause

Docker images/volumes accumulating
Log files not rotated
Build artifacts not cleaned
Database WAL files growing

Solution

# Check disk usage
df -h

# Docker cleanup
docker system prune -a --volumes
docker volume prune

# Clean build artifacts
rm -rf node_modules/.cache
rm -rf .next/cache
rm -rf vendor/cache

# PostgreSQL WAL cleanup
docker exec postgres psql -c "SELECT pg_current_wal_lsn();"
# Verify replication is caught up, then:
docker exec postgres psql -c "CHECKPOINT;"

# Clear old logs
find /var/log -name "*.log" -mtime +7 -delete

Prevention

Implement log rotation (logrotate or Docker logging driver)
Schedule weekly Docker cleanup jobs
Set WAL retention policies in PostgreSQL
Monitor disk usage with alerts at 80% threshold

Issue: SSL/TLS Certificate Errors

Symptoms

"Certificate has expired" errors
"Unable to verify certificate" warnings
HTTPS connections failing
Browser security warnings

Cause

Let's Encrypt certificates expired
Self-signed certificates not trusted
Certificate chain incomplete
System clock drift

Solution

# Check certificate expiry
echo | openssl s_client -connect localhost:443 2>/dev/null | openssl x509 -noout -dates

# Renew Let's Encrypt certificates
certbot renew --dry-run
certbot renew

# Verify certificate chain
openssl s_client -connect localhost:443 -showcerts

# Fix system clock
sudo sntp -sS time.apple.com

Prevention

Set up automatic certificate renewal
Monitor certificate expiry (30 days warning)
Use cert-manager in Kubernetes
Implement certificate expiry alerts

Issue: Configuration Drift

Symptoms

Different behavior between environments
"Works on my machine" issues
Unexpected feature flags enabled/disabled
Database schema mismatches

Cause

Manual configuration changes not committed
Environment-specific overrides
Cached configuration not refreshed
Feature flags out of sync

Solution

# Export current configuration
docker exec <service> cat /app/config.yaml > current-config.yaml

# Compare with repository
diff current-config.yaml config/production.yaml

# Refresh configuration
docker exec <service> kill -HUP 1  # Reload config

# Sync feature flags
buildkit config sync --env production

# Reset to known state
docker compose down && git checkout . && docker compose up -d

Prevention

Use GitOps for all configuration
Implement configuration validation in CI/CD
Use immutable configuration patterns
Document all configuration options

Issue: Build Failures

Symptoms

CI/CD pipeline failures
Local builds succeed but remote fails
Dependency resolution errors
Docker build cache issues

Cause

Missing build dependencies
Network timeouts during package download
Incompatible package versions
Docker layer cache corruption

Solution

# Clear local caches
rm -rf node_modules package-lock.json
npm install

# For PHP/Composer
rm -rf vendor composer.lock
composer install

# Docker build without cache
docker build --no-cache -t <image> .

# Check GitLab CI runner resources
# In .gitlab-ci.yml, add:
variables:
  GIT_CLEAN_FLAGS: "-ffdx"

Prevention

Pin all dependency versions
Use lock files and commit them
Implement build caching strategies
Set appropriate timeouts for network operations

Issue: Performance Degradation

Symptoms

API response times increasing
Database queries timing out
High CPU usage
Memory pressure alerts

Cause

N+1 query problems
Missing database indexes
Inefficient algorithms
Resource contention

Solution

# Profile API endpoints
curl -w "@curl-format.txt" -o /dev/null -s "http://localhost:3000/api/endpoint"

# Check slow queries (PostgreSQL)
docker exec postgres psql -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

# Monitor resource usage
docker stats

# Profile application
node --prof app.js
node --prof-process isolate-*.log > processed.txt

Prevention

Implement query analyzers in development
Add database indexes proactively
Use APM tools (New Relic, Datadog)
Set up performance budgets in CI/CD

Back to Troubleshooting Home | CI/CD Issues

Common Platform Issues

Common Platform Issues

Issue: Service Not Starting

Symptoms

Cause

Solution

Prevention

Issue: Memory Exhaustion

Symptoms

Cause

Solution

Prevention

Issue: Network Connectivity Failures

Symptoms

Cause

Solution

Prevention

Issue: Disk Space Exhaustion

Symptoms

Cause

Solution

Prevention

Issue: SSL/TLS Certificate Errors

Symptoms

Cause

Solution

Prevention

Issue: Configuration Drift

Symptoms

Cause

Solution

Prevention

Issue: Build Failures

Symptoms

Cause

Solution

Prevention

Issue: Performance Degradation

Symptoms

Cause

Solution

Prevention

Related Documentation