Workflow Engine Runbook
Workflow Engine Runbook
** MIGRATED**: This runbook has been moved to the workflow-engine project wiki.
For project-specific runbooks: See workflow-engine.wiki/runbooks
For ecosystem patterns: See Workflow Engine Architecture in this wiki
Separation of Duties: See Separation of Duties - workflow-engine is responsible for workflow orchestration. It does NOT own agent manifests or execution.
Overview
- Purpose: Workflow orchestration service managing multi-step agent workflows, task scheduling, parallel execution, and workflow state management. Coordinates complex agent interactions and business processes.
- Port: 3003
- Health endpoint:
GET /healthorGET /api/v1/health - Namespace:
agents(Kubernetes) - Technology: Python/Celery or Node.js/Bull
Dependencies
- Redis (port 6379) - Task queue and workflow state
- PostgreSQL (port 5432) - Workflow definitions and history
- Agent Mesh (port 3000) - Agent coordination
- Agent Brain (port 3001) - Agent task execution
Common Issues
Issue 1: Workflow Stuck in Pending State
- Symptoms:
- Workflows not progressing
- "pending" state for extended periods
- No worker pickup visible in logs
- Cause:
- No available workers
- Redis queue connection lost
- Task serialization failure
- Resolution:
# Check worker status curl http://localhost:3003/api/v1/workers # Check Redis queue depth redis-cli -h localhost -p 6379 llen "workflow:tasks:pending" # Check for dead workers curl http://localhost:3003/api/v1/workers/dead # Restart workers kubectl rollout restart deployment/workflow-workers -n agents # Manually retry stuck workflow curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry
Issue 2: Task Timeout
- Symptoms:
- Tasks failing with timeout errors
- Partial workflow completion
- "Task exceeded maximum runtime" in logs
- Cause:
- Long-running LLM calls
- Downstream service slow
- Timeout configured too low
- Resolution:
# Check task execution times curl http://localhost:3003/api/v1/tasks/stats | jq '.avg_duration' # Identify slow tasks curl http://localhost:3003/api/v1/tasks/slow?threshold=60s # Increase timeout for specific task type curl -X PATCH http://localhost:3003/api/v1/task-types/llm_inference \ -H "Content-Type: application/json" \ -d '{"timeout": 300}' # Retry timed-out task with extended timeout curl -X POST http://localhost:3003/api/v1/tasks/{task_id}/retry?timeout=600
Issue 3: Parallel Execution Deadlock
- Symptoms:
- Multiple tasks waiting for each other
- Workflow progress halted
- Circular dependency detected warnings
- Cause:
- Poorly designed workflow DAG
- Resource contention
- Lock not released
- Resolution:
# Visualize workflow DAG curl http://localhost:3003/api/v1/workflows/{workflow_id}/dag > dag.json # Check for locks redis-cli -h localhost -p 6379 keys "workflow:lock:*" # Force release locks (use with caution) redis-cli -h localhost -p 6379 del "workflow:lock:{workflow_id}" # Cancel deadlocked workflow curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel # Enable deadlock detection kubectl set env deployment/workflow-engine -n agents DEADLOCK_DETECTION=true
Issue 4: Worker Memory Exhaustion
- Symptoms:
- OOMKilled events
- Workers restarting frequently
- Memory climbing over time
- Cause:
- Large payloads in tasks
- Memory leaks in task handlers
- Too many concurrent tasks per worker
- Resolution:
# Check worker memory kubectl top pods -n agents -l app=workflow-workers # Reduce concurrency kubectl set env deployment/workflow-workers -n agents WORKER_CONCURRENCY=2 # Enable memory profiling kubectl set env deployment/workflow-workers -n agents MEMORY_PROFILER=true # Restart workers with new limits kubectl set resources deployment/workflow-workers -n agents \ --limits=memory=2Gi --requests=memory=512Mi
Issue 5: Workflow Versioning Conflict
- Symptoms:
- Old workflow version executing
- "Workflow definition not found" errors
- Inconsistent behavior between runs
- Cause:
- Workflow definition updated mid-execution
- Cache serving stale definition
- Database replication lag
- Resolution:
# Check active workflow versions curl http://localhost:3003/api/v1/workflows/versions # Clear workflow definition cache redis-cli -h localhost -p 6379 del "workflow:definitions:*" # Force reload definitions curl -X POST http://localhost:3003/api/v1/workflows/reload # Pin specific version for execution curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/pin?version=2
Restart Procedure
Graceful Restart (Recommended)
# 1. Stop accepting new workflows curl -X POST http://localhost:3003/api/v1/pause # 2. Wait for running workflows to complete (or timeout after 5min) timeout 300 bash -c 'while [ $(curl -s http://localhost:3003/api/v1/workflows/running | jq length) -gt 0 ]; do sleep 10; done' # 3. Rolling restart kubectl rollout restart deployment/workflow-engine -n agents kubectl rollout restart deployment/workflow-workers -n agents # 4. Wait for ready kubectl rollout status deployment/workflow-engine -n agents kubectl rollout status deployment/workflow-workers -n agents # 5. Resume workflow processing curl -X POST http://localhost:3003/api/v1/resume # 6. Verify health curl http://localhost:3003/health
Emergency Restart
# Force restart (may lose in-progress workflows) kubectl delete pods -n agents -l app=workflow-engine --force kubectl delete pods -n agents -l app=workflow-workers --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=workflow-engine -n agents --timeout=120s # Check for orphaned workflows curl http://localhost:3003/api/v1/workflows/orphaned # Recover or cancel orphaned workflows curl -X POST http://localhost:3003/api/v1/workflows/recover-orphaned
Local Development Restart
# Docker Compose docker compose restart workflow-engine workflow-workers # Python/Celery pkill -f "celery.*workflow" && celery -A workflow worker -l info # Node.js/Bull pkill -f "workflow-engine" && npm run start:workflow
Logs Location
Kubernetes Logs
# Engine logs kubectl logs -f deployment/workflow-engine -n agents # Worker logs kubectl logs -f deployment/workflow-workers -n agents # All workflow-related logs kubectl logs -l app.kubernetes.io/component=workflow -n agents --all-containers # Filter by workflow ID kubectl logs deployment/workflow-engine -n agents | grep "workflow_id=abc123"
Local Logs
# Application logs tail -f /var/log/workflow-engine/app.log # Celery worker logs tail -f /var/log/workflow-engine/worker.log # Task execution logs tail -f /var/log/workflow-engine/tasks.log
Workflow Tracing
# Get workflow execution trace curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace # Export to JSON for analysis curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace > trace.json
Scaling
Worker Scaling
# Scale workers for more parallel tasks kubectl scale deployment/workflow-workers --replicas=10 -n agents # Enable HPA kubectl autoscale deployment/workflow-workers -n agents \ --min=2 --max=20 --cpu-percent=70
Engine Scaling
# Scale engine for more workflow coordination kubectl scale deployment/workflow-engine --replicas=3 -n agents
Queue-Based Scaling
# Scale based on queue depth (custom HPA) kubectl apply -f - <<EOF apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: workflow-workers-scaler namespace: agents spec: scaleTargetRef: name: workflow-workers minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: redis metadata: address: redis:6379 listName: workflow:tasks:pending listLength: "10" EOF
Scaling Guidelines
| Metric | Threshold | Action |
|---|---|---|
| Queue Depth | > 100 pending tasks | Scale workers |
| Worker CPU | > 70% | Scale workers |
| Task Completion Rate | < 10/min | Scale workers, investigate |
| Active Workflows | > 50/engine | Scale engine |
Alerts
Critical Alerts (PagerDuty)
| Alert | Condition | Runbook Action |
|---|---|---|
| EngineDown | 0 healthy pods for 2min | Emergency Restart |
| AllWorkersDown | 0 healthy workers for 2min | Restart workers |
| WorkflowStuckCritical | Workflow pending >1hr | Manual intervention |
| QueueBacklogCritical | >1000 pending tasks | Scale immediately |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|---|---|
| HighTaskLatency | Avg task duration >5min | Investigate, scale |
| WorkerRestarting | >3 restarts in 10min | Check memory, logs |
| QueueBacklog | >100 pending tasks | Scale workers |
| WorkflowFailureRate | >10% failure in 1hr | Investigate errors |
Prometheus Alert Rules
groups: - name: workflow-engine rules: - alert: WorkflowEngineDown expr: up{job="workflow-engine"} == 0 for: 2m labels: severity: critical annotations: summary: "Workflow Engine is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/workflow-engine" - alert: WorkflowWorkersDown expr: up{job="workflow-workers"} == 0 for: 2m labels: severity: critical annotations: summary: "Workflow Workers are down" - alert: HighQueueDepth expr: workflow_queue_depth > 100 for: 10m labels: severity: warning annotations: summary: "Workflow queue depth high" - alert: WorkflowStuck expr: workflow_pending_duration_seconds > 3600 for: 5m labels: severity: warning annotations: summary: "Workflow stuck in pending state"
Monitoring Dashboards
- Grafana:
https://grafana.local/d/workflow-engine - Task Queue Dashboard:
https://grafana.local/d/workflow-queue - Flower (Celery):
http://localhost:5555(if using Celery) - Bull Dashboard:
http://localhost:3003/admin/queues(if using Bull)
Workflow Management Commands
Create Workflow
curl -X POST http://localhost:3003/api/v1/workflows \ -H "Content-Type: application/json" \ -d '{ "name": "data-pipeline", "steps": [ {"id": "fetch", "type": "http", "url": "..."}, {"id": "process", "type": "agent", "agent": "processor", "depends_on": ["fetch"]}, {"id": "store", "type": "database", "depends_on": ["process"]} ] }'
Cancel Workflow
curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel
Retry Failed Workflow
curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry
Contacts
- On-call: PagerDuty rotation
- Slack: #platform-incidents
- Owner: Platform Team
Related Runbooks
- Redis Runbook - Task queue backend
- PostgreSQL Runbook - Workflow storage
- Agent Mesh Runbook - Agent coordination
- Agent Brain Runbook - Task execution