Skip to main content

llm platform

LLM Platform Runbook

Separation of Duties: See Separation of Duties - llm-platform is responsible for Drupal-based web platform. It does NOT own agent manifests or execution.

Overview

  • Purpose: Main Drupal-based web platform for LLM operations, providing content management, user authentication, API gateway, admin interfaces, and integration with AI agent services. Serves as the primary user-facing application.
  • Port: 8080 (HTTP), 443 (HTTPS via ingress)
  • Health endpoint: GET /health or GET /api/health
  • Namespace: llm-platform (Kubernetes)
  • Technology: Drupal 11, PHP 8.3, MySQL/MariaDB, Redis
  • Repository: https://gitlab.com/blueflyio/llm-platform-demo

Dependencies

  • MariaDB/MySQL (port 3306) - Primary database
  • Redis (port 6379) - Cache and session storage
  • Solr (port 8983) - Search indexing (optional)
  • Agent Router (port 3004) - LLM routing
  • Agent Brain (port 3006) - Agent state
  • S3/MinIO - File storage
  • SMTP - Email delivery

Core Components

ComponentPortDescription
Drupal Web8080Main web application
PHP-FPM9000PHP process manager
Nginx80/443Web server/reverse proxy
Drush CLIN/ADrupal command-line tool
CronN/AScheduled tasks

Common Issues

Issue 1: White Screen of Death (WSOD)

  • Symptoms:
    • Blank white page
    • No error messages displayed
    • 500 Internal Server Error
  • Cause:
    • PHP fatal error
    • Memory limit exceeded
    • Module conflict
  • Resolution:
    # Check PHP error logs tail -f /var/log/php/error.log tail -f /var/log/nginx/error.log # Check Drupal logs drush watchdog:show --severity=error --count=50 # Enable error display temporarily drush state:set system.maintenance_mode 1 drush php:eval "ini_set('display_errors', 1); error_reporting(E_ALL);" # Clear all caches drush cr # Check memory limit php -i | grep memory_limit # Increase memory if needed kubectl set env deployment/llm-platform -n llm-platform \ PHP_MEMORY_LIMIT=512M # Rebuild container drush cache:rebuild

Issue 2: Database Connection Failures

  • Symptoms:
    • "PDO connection failed" errors
    • "SQLSTATE[HY000]" errors
    • Site completely down
  • Cause:
    • Database server down
    • Connection credentials incorrect
    • Connection pool exhausted
  • Resolution:
    # Check database connectivity drush sql:cli -c "SELECT 1" # Check database status mysql -h db.local -u drupal -p -e "SHOW STATUS" # Verify connection settings drush status --fields=db-hostname,db-port,db-name,db-driver # Check connection count mysql -e "SHOW STATUS LIKE 'Threads_connected'" # Kill idle connections mysql -e "SHOW PROCESSLIST" | grep Sleep | awk '{print $1}' | xargs -I {} mysql -e "KILL {}" # Restart database connection drush php:eval "\Drupal::database()->query('SELECT 1');" # Check database credentials in settings.php drush php:eval "print_r(\Drupal::database()->getConnectionOptions());"

Issue 3: Cache Issues

  • Symptoms:
    • Stale content displayed
    • Changes not appearing
    • Inconsistent page views
  • Cause:
    • Redis cache stale
    • Varnish cache not cleared
    • Drupal cache tables corrupted
  • Resolution:
    # Clear all Drupal caches drush cr # Clear specific cache bins drush cc render drush cc page drush cc menu # Clear Redis cache redis-cli FLUSHDB # Check Redis connectivity drush php:eval "print_r(\Drupal::service('cache.backend.redis')->get('discovery'));" # Clear Varnish cache (if used) curl -X PURGE http://varnish.local/ # Rebuild cache tables drush sql:query "TRUNCATE TABLE cache_bootstrap" drush sql:query "TRUNCATE TABLE cache_config" drush sql:query "TRUNCATE TABLE cache_container" drush sql:query "TRUNCATE TABLE cache_data" drush sql:query "TRUNCATE TABLE cache_default" drush sql:query "TRUNCATE TABLE cache_discovery" drush sql:query "TRUNCATE TABLE cache_render" drush cr

Issue 4: Configuration Import Failures

  • Symptoms:
    • drush cim fails with errors
    • "Configuration ... already exists" errors
    • Missing configuration dependencies
  • Cause:
    • Configuration drift
    • Missing module
    • UUID mismatch
  • Resolution:
    # Check configuration status drush cst # View configuration diff drush config:diff # Import specific configuration drush config:import --source=../config/sync --partial # Skip problematic configs drush cim --skip-modules=problematic_module # Force import with delete drush cim -y --source=../config/sync # Fix UUID mismatch drush php:eval "\$config = \Drupal::service('config.factory')->getEditable('system.site'); \$config->set('uuid', 'your-site-uuid')->save();" # Rebuild configuration drush config:rebuild # Export current config for comparison drush cex -y

Issue 5: Module Update/Install Failures

  • Symptoms:
    • Update hooks failing
    • Schema update errors
    • Module cannot be enabled
  • Cause:
    • Missing dependencies
    • Database schema out of sync
    • PHP version incompatibility
  • Resolution:
    # Check pending updates drush updb --no-interaction --simulate # Run database updates drush updb -y # Check module status drush pm:list --status=enabled # Enable module with dependencies drush pm:install module_name -y # Check entity updates drush entity:updates # Rebuild entity schema drush entity-updates -y # Check PHP requirements composer check-platform-reqs # Clear Composer cache and reinstall composer clear-cache composer install --no-dev --optimize-autoloader

Issue 6: Slow Page Load Performance

  • Symptoms:
    • Pages taking >5s to load
    • High server response time
    • Timeout errors
  • Cause:
    • Unoptimized queries
    • Missing cache
    • External service delays
  • Resolution:
    # Enable query logging drush state:set system.logging.slow_query_threshold 1000 # Check slow queries drush watchdog:show --type=php --filter='slow' # View performance metrics drush php:eval "print_r(\Drupal::service('performance_metrics')->getAll());" # Enable Redis cache if not configured drush pm:install redis -y # Optimize CSS/JS aggregation drush config:set system.performance css.preprocess 1 -y drush config:set system.performance js.preprocess 1 -y drush cr # Check external service response times drush php:eval "\$start = microtime(true); \Drupal::httpClient()->get('http://agent-router:3004/health'); print microtime(true) - \$start;" # Enable page cache drush pm:install page_cache -y drush pm:install dynamic_page_cache -y # Check Opcache status php -i | grep opcache

Issue 7: Cron Not Running

  • Symptoms:
    • Scheduled tasks not executing
    • Queue items building up
    • Search index stale
  • Cause:
    • Cron job disabled
    • Cron URL blocked
    • PHP timeout too short
  • Resolution:
    # Check cron status drush cron-status # Run cron manually drush cron # View cron logs drush watchdog:show --type=cron --count=20 # Check queue status drush queue:list # Process specific queue drush queue:run aggregator_feeds # Set cron key if needed drush state:set system.cron_key "$(openssl rand -hex 16)" # Check cron URL curl -I "https://llm-platform.local/cron/$(drush state:get system.cron_key)" # Enable automated cron drush config:set automated_cron.settings interval 3600 -y

Restart Procedure

# 1. Enable maintenance mode drush state:set system.maintenance_mode 1 # 2. Clear caches drush cr # 3. Rolling restart PHP-FPM kubectl rollout restart deployment/llm-platform -n llm-platform # 4. Monitor rollout kubectl rollout status deployment/llm-platform -n llm-platform # 5. Run database updates if needed drush updb -y # 6. Clear caches again drush cr # 7. Disable maintenance mode drush state:set system.maintenance_mode 0 # 8. Verify health curl http://llm-platform.local/health

Emergency Restart

# Force kill all pods kubectl delete pods -n llm-platform -l app=llm-platform --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=llm-platform -n llm-platform --timeout=300s # Clear all caches drush cr # Verify site functionality drush status curl http://llm-platform.local/

Local Development Restart (DDEV)

# Restart DDEV environment ddev restart # Or specific containers ddev exec supervisorctl restart php-fpm ddev exec nginx -s reload # Full rebuild ddev stop ddev start # Run updates after restart ddev drush updb -y ddev drush cr

Docker Compose Restart

# Graceful restart docker compose restart drupal # Force restart with rebuild docker compose down drupal docker compose up -d --build drupal # View logs docker compose logs -f drupal

Logs Location

Kubernetes Logs

# Real-time logs kubectl logs -f deployment/llm-platform -n llm-platform # PHP-FPM logs kubectl logs -f deployment/llm-platform -n llm-platform -c php-fpm # Nginx logs kubectl logs -f deployment/llm-platform -n llm-platform -c nginx # Export for analysis kubectl logs deployment/llm-platform -n llm-platform > platform-logs-$(date +%Y%m%d).txt

Drupal Logs

# Watchdog logs drush watchdog:show --count=100 drush watchdog:show --severity=error drush watchdog:show --type=php # Export watchdog drush watchdog:show --format=csv > watchdog-export.csv # Clear old logs drush watchdog:delete --severity=notice

System Logs

# PHP error log tail -f /var/log/php/error.log # Nginx access log tail -f /var/log/nginx/access.log # Nginx error log tail -f /var/log/nginx/error.log # Cron log tail -f /var/log/drupal/cron.log

Scaling

Horizontal Scaling

# Scale web replicas kubectl scale deployment/llm-platform --replicas=5 -n llm-platform # Enable HPA kubectl autoscale deployment/llm-platform -n llm-platform \ --min=2 --max=10 --cpu-percent=70

Vertical Scaling

# Increase resources kubectl set resources deployment/llm-platform -n llm-platform \ --limits=cpu=2000m,memory=2Gi \ --requests=cpu=500m,memory=512Mi # Increase PHP-FPM workers kubectl set env deployment/llm-platform -n llm-platform \ PHP_FPM_PM_MAX_CHILDREN=50 \ PHP_FPM_PM_START_SERVERS=10

Database Scaling

# Add read replica kubectl scale statefulset/mariadb-replica --replicas=3 -n llm-platform # Increase database resources kubectl set resources statefulset/mariadb -n llm-platform \ --limits=cpu=4000m,memory=8Gi

Scaling Guidelines

MetricThresholdAction
CPU Usage> 70%Scale horizontally
Memory Usage> 80%Add memory or replica
Response Time P99> 3sScale, optimize queries
Database Connections> 80%Scale database, add replicas
PHP-FPM Queue> 50Add workers or replicas
Redis Memory> 80%Increase memory, purge

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
PlatformDown0 healthy pods for 2minEmergency Restart
DatabaseDownMariaDB unreachable 5minCheck database, failover
StorageFullDisk >95% fullExpand storage, cleanup
HighErrorRate>10% 5xx errorsInvestigate logs

Warning Alerts (Slack)

AlertConditionRunbook Action
HighLatencyP99 > 5s for 5minScale, optimize
CacheDownRedis unreachableRestart Redis
CronStaleNo cron run >2 hoursCheck cron, run manually
QueueBacklog>1000 itemsScale consumers
LowDiskSpace>80% usedCleanup, expand

Prometheus Alert Rules

groups: - name: llm-platform rules: - alert: LLMPlatformDown expr: up{job="llm-platform"} == 0 for: 2m labels: severity: critical annotations: summary: "LLM Platform is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/llm-platform" - alert: DatabaseConnectionIssues expr: drupal_database_available == 0 for: 5m labels: severity: critical annotations: summary: "Database connection issues" - alert: HighResponseTime expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="llm-platform"}[5m])) > 5 for: 5m labels: severity: warning annotations: summary: "High response time" - alert: HighErrorRate expr: rate(http_requests_total{job="llm-platform",status=~"5.."}[5m]) / rate(http_requests_total{job="llm-platform"}[5m]) > 0.1 for: 5m labels: severity: critical annotations: summary: "High error rate" - alert: CronNotRunning expr: time() - drupal_cron_last_run_timestamp > 7200 for: 10m labels: severity: warning annotations: summary: "Cron has not run in 2 hours"

Monitoring Dashboards

  • Grafana - LLM Platform: https://grafana.local/d/llm-platform
  • PHP-FPM Status: http://llm-platform.local/fpm-status
  • Drupal Status: drush status
  • Database Monitoring: https://grafana.local/d/mariadb

Drush Command Reference

Cache Management

# Rebuild all caches drush cr # Clear specific cache bins drush cc render drush cc page drush cc menu drush cc discovery drush cc config # View cache statistics drush cache:stats

Configuration Management

# Export configuration drush cex -y # Import configuration drush cim -y # View configuration differences drush cst drush config:diff # Set configuration value drush config:set system.site name "LLM Platform" -y # Get configuration value drush config:get system.site name

Database Management

# Database status drush sql:cli -c "SHOW STATUS" # Run SQL query drush sql:query "SELECT COUNT(*) FROM users" # Backup database drush sql:dump > backup-$(date +%Y%m%d).sql # Restore database drush sql:cli < backup.sql # Sanitize database drush sql:sanitize -y

User Management

# Create user drush user:create admin --mail="admin@example.com" --password="secure123" # Reset password drush user:password admin "new_password" # Assign role drush user:role:add administrator admin # Block user drush user:block spammer # Login as user drush uli admin

Module Management

# List modules drush pm:list --status=enabled # Install module drush pm:install module_name -y # Uninstall module drush pm:uninstall module_name -y # Check for updates drush pm:security

Maintenance

# Enable maintenance mode drush state:set system.maintenance_mode 1 # Disable maintenance mode drush state:set system.maintenance_mode 0 # Run database updates drush updb -y # Run cron drush cron # Clear queue items drush queue:delete queue_name

Contacts