Separation of Duties: See Separation of Duties - llm-platform is responsible for Drupal-based web platform. It does NOT own agent manifests or execution.
Overview
- Purpose: Main Drupal-based web platform for LLM operations, providing content management, user authentication, API gateway, admin interfaces, and integration with AI agent services. Serves as the primary user-facing application.
- Port: 8080 (HTTP), 443 (HTTPS via ingress)
- Health endpoint:
GET /health or GET /api/health
- Namespace:
llm-platform (Kubernetes)
- Technology: Drupal 11, PHP 8.3, MySQL/MariaDB, Redis
- Repository: https://gitlab.com/blueflyio/llm-platform-demo
Dependencies
- MariaDB/MySQL (port 3306) - Primary database
- Redis (port 6379) - Cache and session storage
- Solr (port 8983) - Search indexing (optional)
- Agent Router (port 3004) - LLM routing
- Agent Brain (port 3006) - Agent state
- S3/MinIO - File storage
- SMTP - Email delivery
Core Components
| Component | Port | Description |
|---|
| Drupal Web | 8080 | Main web application |
| PHP-FPM | 9000 | PHP process manager |
| Nginx | 80/443 | Web server/reverse proxy |
| Drush CLI | N/A | Drupal command-line tool |
| Cron | N/A | Scheduled tasks |
Common Issues
Issue 1: White Screen of Death (WSOD)
- Symptoms:
- Blank white page
- No error messages displayed
- 500 Internal Server Error
- Cause:
- PHP fatal error
- Memory limit exceeded
- Module conflict
- Resolution:
# Check PHP error logs
tail -f /var/log/php/error.log
tail -f /var/log/nginx/error.log
# Check Drupal logs
drush watchdog:show --severity=error --count=50
# Enable error display temporarily
drush state:set system.maintenance_mode 1
drush php:eval "ini_set('display_errors', 1); error_reporting(E_ALL);"
# Clear all caches
drush cr
# Check memory limit
php -i | grep memory_limit
# Increase memory if needed
kubectl set env deployment/llm-platform -n llm-platform \
PHP_MEMORY_LIMIT=512M
# Rebuild container
drush cache:rebuild
Issue 2: Database Connection Failures
- Symptoms:
- "PDO connection failed" errors
- "SQLSTATE[HY000]" errors
- Site completely down
- Cause:
- Database server down
- Connection credentials incorrect
- Connection pool exhausted
- Resolution:
# Check database connectivity
drush sql:cli -c "SELECT 1"
# Check database status
mysql -h db.local -u drupal -p -e "SHOW STATUS"
# Verify connection settings
drush status --fields=db-hostname,db-port,db-name,db-driver
# Check connection count
mysql -e "SHOW STATUS LIKE 'Threads_connected'"
# Kill idle connections
mysql -e "SHOW PROCESSLIST" | grep Sleep | awk '{print $1}' | xargs -I {} mysql -e "KILL {}"
# Restart database connection
drush php:eval "\Drupal::database()->query('SELECT 1');"
# Check database credentials in settings.php
drush php:eval "print_r(\Drupal::database()->getConnectionOptions());"
Issue 3: Cache Issues
- Symptoms:
- Stale content displayed
- Changes not appearing
- Inconsistent page views
- Cause:
- Redis cache stale
- Varnish cache not cleared
- Drupal cache tables corrupted
- Resolution:
# Clear all Drupal caches
drush cr
# Clear specific cache bins
drush cc render
drush cc page
drush cc menu
# Clear Redis cache
redis-cli FLUSHDB
# Check Redis connectivity
drush php:eval "print_r(\Drupal::service('cache.backend.redis')->get('discovery'));"
# Clear Varnish cache (if used)
curl -X PURGE http://varnish.local/
# Rebuild cache tables
drush sql:query "TRUNCATE TABLE cache_bootstrap"
drush sql:query "TRUNCATE TABLE cache_config"
drush sql:query "TRUNCATE TABLE cache_container"
drush sql:query "TRUNCATE TABLE cache_data"
drush sql:query "TRUNCATE TABLE cache_default"
drush sql:query "TRUNCATE TABLE cache_discovery"
drush sql:query "TRUNCATE TABLE cache_render"
drush cr
Issue 4: Configuration Import Failures
- Symptoms:
drush cim fails with errors
- "Configuration ... already exists" errors
- Missing configuration dependencies
- Cause:
- Configuration drift
- Missing module
- UUID mismatch
- Resolution:
# Check configuration status
drush cst
# View configuration diff
drush config:diff
# Import specific configuration
drush config:import --source=../config/sync --partial
# Skip problematic configs
drush cim --skip-modules=problematic_module
# Force import with delete
drush cim -y --source=../config/sync
# Fix UUID mismatch
drush php:eval "\$config = \Drupal::service('config.factory')->getEditable('system.site'); \$config->set('uuid', 'your-site-uuid')->save();"
# Rebuild configuration
drush config:rebuild
# Export current config for comparison
drush cex -y
Issue 5: Module Update/Install Failures
- Symptoms:
- Update hooks failing
- Schema update errors
- Module cannot be enabled
- Cause:
- Missing dependencies
- Database schema out of sync
- PHP version incompatibility
- Resolution:
# Check pending updates
drush updb --no-interaction --simulate
# Run database updates
drush updb -y
# Check module status
drush pm:list --status=enabled
# Enable module with dependencies
drush pm:install module_name -y
# Check entity updates
drush entity:updates
# Rebuild entity schema
drush entity-updates -y
# Check PHP requirements
composer check-platform-reqs
# Clear Composer cache and reinstall
composer clear-cache
composer install --no-dev --optimize-autoloader
Issue 6: Slow Page Load Performance
- Symptoms:
- Pages taking >5s to load
- High server response time
- Timeout errors
- Cause:
- Unoptimized queries
- Missing cache
- External service delays
- Resolution:
# Enable query logging
drush state:set system.logging.slow_query_threshold 1000
# Check slow queries
drush watchdog:show --type=php --filter='slow'
# View performance metrics
drush php:eval "print_r(\Drupal::service('performance_metrics')->getAll());"
# Enable Redis cache if not configured
drush pm:install redis -y
# Optimize CSS/JS aggregation
drush config:set system.performance css.preprocess 1 -y
drush config:set system.performance js.preprocess 1 -y
drush cr
# Check external service response times
drush php:eval "\$start = microtime(true); \Drupal::httpClient()->get('http://agent-router:3004/health'); print microtime(true) - \$start;"
# Enable page cache
drush pm:install page_cache -y
drush pm:install dynamic_page_cache -y
# Check Opcache status
php -i | grep opcache
Issue 7: Cron Not Running
- Symptoms:
- Scheduled tasks not executing
- Queue items building up
- Search index stale
- Cause:
- Cron job disabled
- Cron URL blocked
- PHP timeout too short
- Resolution:
# Check cron status
drush cron-status
# Run cron manually
drush cron
# View cron logs
drush watchdog:show --type=cron --count=20
# Check queue status
drush queue:list
# Process specific queue
drush queue:run aggregator_feeds
# Set cron key if needed
drush state:set system.cron_key "$(openssl rand -hex 16)"
# Check cron URL
curl -I "https://llm-platform.local/cron/$(drush state:get system.cron_key)"
# Enable automated cron
drush config:set automated_cron.settings interval 3600 -y
Restart Procedure
Graceful Restart (Recommended)
# 1. Enable maintenance mode
drush state:set system.maintenance_mode 1
# 2. Clear caches
drush cr
# 3. Rolling restart PHP-FPM
kubectl rollout restart deployment/llm-platform -n llm-platform
# 4. Monitor rollout
kubectl rollout status deployment/llm-platform -n llm-platform
# 5. Run database updates if needed
drush updb -y
# 6. Clear caches again
drush cr
# 7. Disable maintenance mode
drush state:set system.maintenance_mode 0
# 8. Verify health
curl http://llm-platform.local/health
Emergency Restart
# Force kill all pods
kubectl delete pods -n llm-platform -l app=llm-platform --force
# Wait for recovery
kubectl wait --for=condition=ready pod -l app=llm-platform -n llm-platform --timeout=300s
# Clear all caches
drush cr
# Verify site functionality
drush status
curl http://llm-platform.local/
Local Development Restart (DDEV)
# Restart DDEV environment
ddev restart
# Or specific containers
ddev exec supervisorctl restart php-fpm
ddev exec nginx -s reload
# Full rebuild
ddev stop
ddev start
# Run updates after restart
ddev drush updb -y
ddev drush cr
Docker Compose Restart
# Graceful restart
docker compose restart drupal
# Force restart with rebuild
docker compose down drupal
docker compose up -d --build drupal
# View logs
docker compose logs -f drupal
Logs Location
Kubernetes Logs
# Real-time logs
kubectl logs -f deployment/llm-platform -n llm-platform
# PHP-FPM logs
kubectl logs -f deployment/llm-platform -n llm-platform -c php-fpm
# Nginx logs
kubectl logs -f deployment/llm-platform -n llm-platform -c nginx
# Export for analysis
kubectl logs deployment/llm-platform -n llm-platform > platform-logs-$(date +%Y%m%d).txt
Drupal Logs
# Watchdog logs
drush watchdog:show --count=100
drush watchdog:show --severity=error
drush watchdog:show --type=php
# Export watchdog
drush watchdog:show --format=csv > watchdog-export.csv
# Clear old logs
drush watchdog:delete --severity=notice
System Logs
# PHP error log
tail -f /var/log/php/error.log
# Nginx access log
tail -f /var/log/nginx/access.log
# Nginx error log
tail -f /var/log/nginx/error.log
# Cron log
tail -f /var/log/drupal/cron.log
Scaling
Horizontal Scaling
# Scale web replicas
kubectl scale deployment/llm-platform --replicas=5 -n llm-platform
# Enable HPA
kubectl autoscale deployment/llm-platform -n llm-platform \
--min=2 --max=10 --cpu-percent=70
Vertical Scaling
# Increase resources
kubectl set resources deployment/llm-platform -n llm-platform \
--limits=cpu=2000m,memory=2Gi \
--requests=cpu=500m,memory=512Mi
# Increase PHP-FPM workers
kubectl set env deployment/llm-platform -n llm-platform \
PHP_FPM_PM_MAX_CHILDREN=50 \
PHP_FPM_PM_START_SERVERS=10
Database Scaling
# Add read replica
kubectl scale statefulset/mariadb-replica --replicas=3 -n llm-platform
# Increase database resources
kubectl set resources statefulset/mariadb -n llm-platform \
--limits=cpu=4000m,memory=8Gi
Scaling Guidelines
| Metric | Threshold | Action |
|---|
| CPU Usage | > 70% | Scale horizontally |
| Memory Usage | > 80% | Add memory or replica |
| Response Time P99 | > 3s | Scale, optimize queries |
| Database Connections | > 80% | Scale database, add replicas |
| PHP-FPM Queue | > 50 | Add workers or replicas |
| Redis Memory | > 80% | Increase memory, purge |
Alerts
| Alert | Condition | Runbook Action |
|---|
| PlatformDown | 0 healthy pods for 2min | Emergency Restart |
| DatabaseDown | MariaDB unreachable 5min | Check database, failover |
| StorageFull | Disk >95% full | Expand storage, cleanup |
| HighErrorRate | >10% 5xx errors | Investigate logs |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|
| HighLatency | P99 > 5s for 5min | Scale, optimize |
| CacheDown | Redis unreachable | Restart Redis |
| CronStale | No cron run >2 hours | Check cron, run manually |
| QueueBacklog | >1000 items | Scale consumers |
| LowDiskSpace | >80% used | Cleanup, expand |
Prometheus Alert Rules
groups:
- name: llm-platform
rules:
- alert: LLMPlatformDown
expr: up{job="llm-platform"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "LLM Platform is down"
runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/llm-platform"
- alert: DatabaseConnectionIssues
expr: drupal_database_available == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Database connection issues"
- alert: HighResponseTime
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="llm-platform"}[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
- alert: HighErrorRate
expr: rate(http_requests_total{job="llm-platform",status=~"5.."}[5m]) / rate(http_requests_total{job="llm-platform"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate"
- alert: CronNotRunning
expr: time() - drupal_cron_last_run_timestamp > 7200
for: 10m
labels:
severity: warning
annotations:
summary: "Cron has not run in 2 hours"
Monitoring Dashboards
- Grafana - LLM Platform:
https://grafana.local/d/llm-platform
- PHP-FPM Status:
http://llm-platform.local/fpm-status
- Drupal Status:
drush status
- Database Monitoring:
https://grafana.local/d/mariadb
Drush Command Reference
Cache Management
# Rebuild all caches
drush cr
# Clear specific cache bins
drush cc render
drush cc page
drush cc menu
drush cc discovery
drush cc config
# View cache statistics
drush cache:stats
Configuration Management
# Export configuration
drush cex -y
# Import configuration
drush cim -y
# View configuration differences
drush cst
drush config:diff
# Set configuration value
drush config:set system.site name "LLM Platform" -y
# Get configuration value
drush config:get system.site name
Database Management
# Database status
drush sql:cli -c "SHOW STATUS"
# Run SQL query
drush sql:query "SELECT COUNT(*) FROM users"
# Backup database
drush sql:dump > backup-$(date +%Y%m%d).sql
# Restore database
drush sql:cli < backup.sql
# Sanitize database
drush sql:sanitize -y
User Management
# Create user
drush user:create admin --mail="admin@example.com" --password="secure123"
# Reset password
drush user:password admin "new_password"
# Assign role
drush user:role:add administrator admin
# Block user
drush user:block spammer
# Login as user
drush uli admin
Module Management
# List modules
drush pm:list --status=enabled
# Install module
drush pm:install module_name -y
# Uninstall module
drush pm:uninstall module_name -y
# Check for updates
drush pm:security
Maintenance
# Enable maintenance mode
drush state:set system.maintenance_mode 1
# Disable maintenance mode
drush state:set system.maintenance_mode 0
# Run database updates
drush updb -y
# Run cron
drush cron
# Clear queue items
drush queue:delete queue_name
- On-call: PagerDuty rotation
- Slack: #platform-incidents, #drupal-support
- Owner: Platform Team
- Repository: https://gitlab.com/blueflyio/llm-platform-demo