Skip to main content

Workflow Engine Runbook

Workflow Engine Runbook

** MIGRATED**: This runbook has been moved to the workflow-engine project wiki.

For project-specific runbooks: See workflow-engine.wiki/runbooks

For ecosystem patterns: See Workflow Engine Architecture in this wiki

Separation of Duties: See Separation of Duties - workflow-engine is responsible for workflow orchestration. It does NOT own agent manifests or execution.

Overview

  • Purpose: Workflow orchestration service managing multi-step agent workflows, task scheduling, parallel execution, and workflow state management. Coordinates complex agent interactions and business processes.
  • Port: 3003
  • Health endpoint: GET /health or GET /api/v1/health
  • Namespace: agents (Kubernetes)
  • Technology: Python/Celery or Node.js/Bull

Dependencies

  • Redis (port 6379) - Task queue and workflow state
  • PostgreSQL (port 5432) - Workflow definitions and history
  • Agent Mesh (port 3000) - Agent coordination
  • Agent Brain (port 3001) - Agent task execution

Common Issues

Issue 1: Workflow Stuck in Pending State

  • Symptoms:
    • Workflows not progressing
    • "pending" state for extended periods
    • No worker pickup visible in logs
  • Cause:
    • No available workers
    • Redis queue connection lost
    • Task serialization failure
  • Resolution:
    # Check worker status curl http://localhost:3003/api/v1/workers # Check Redis queue depth redis-cli -h localhost -p 6379 llen "workflow:tasks:pending" # Check for dead workers curl http://localhost:3003/api/v1/workers/dead # Restart workers kubectl rollout restart deployment/workflow-workers -n agents # Manually retry stuck workflow curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry

Issue 2: Task Timeout

  • Symptoms:
    • Tasks failing with timeout errors
    • Partial workflow completion
    • "Task exceeded maximum runtime" in logs
  • Cause:
    • Long-running LLM calls
    • Downstream service slow
    • Timeout configured too low
  • Resolution:
    # Check task execution times curl http://localhost:3003/api/v1/tasks/stats | jq '.avg_duration' # Identify slow tasks curl http://localhost:3003/api/v1/tasks/slow?threshold=60s # Increase timeout for specific task type curl -X PATCH http://localhost:3003/api/v1/task-types/llm_inference \ -H "Content-Type: application/json" \ -d '{"timeout": 300}' # Retry timed-out task with extended timeout curl -X POST http://localhost:3003/api/v1/tasks/{task_id}/retry?timeout=600

Issue 3: Parallel Execution Deadlock

  • Symptoms:
    • Multiple tasks waiting for each other
    • Workflow progress halted
    • Circular dependency detected warnings
  • Cause:
    • Poorly designed workflow DAG
    • Resource contention
    • Lock not released
  • Resolution:
    # Visualize workflow DAG curl http://localhost:3003/api/v1/workflows/{workflow_id}/dag > dag.json # Check for locks redis-cli -h localhost -p 6379 keys "workflow:lock:*" # Force release locks (use with caution) redis-cli -h localhost -p 6379 del "workflow:lock:{workflow_id}" # Cancel deadlocked workflow curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel # Enable deadlock detection kubectl set env deployment/workflow-engine -n agents DEADLOCK_DETECTION=true

Issue 4: Worker Memory Exhaustion

  • Symptoms:
    • OOMKilled events
    • Workers restarting frequently
    • Memory climbing over time
  • Cause:
    • Large payloads in tasks
    • Memory leaks in task handlers
    • Too many concurrent tasks per worker
  • Resolution:
    # Check worker memory kubectl top pods -n agents -l app=workflow-workers # Reduce concurrency kubectl set env deployment/workflow-workers -n agents WORKER_CONCURRENCY=2 # Enable memory profiling kubectl set env deployment/workflow-workers -n agents MEMORY_PROFILER=true # Restart workers with new limits kubectl set resources deployment/workflow-workers -n agents \ --limits=memory=2Gi --requests=memory=512Mi

Issue 5: Workflow Versioning Conflict

  • Symptoms:
    • Old workflow version executing
    • "Workflow definition not found" errors
    • Inconsistent behavior between runs
  • Cause:
    • Workflow definition updated mid-execution
    • Cache serving stale definition
    • Database replication lag
  • Resolution:
    # Check active workflow versions curl http://localhost:3003/api/v1/workflows/versions # Clear workflow definition cache redis-cli -h localhost -p 6379 del "workflow:definitions:*" # Force reload definitions curl -X POST http://localhost:3003/api/v1/workflows/reload # Pin specific version for execution curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/pin?version=2

Restart Procedure

# 1. Stop accepting new workflows curl -X POST http://localhost:3003/api/v1/pause # 2. Wait for running workflows to complete (or timeout after 5min) timeout 300 bash -c 'while [ $(curl -s http://localhost:3003/api/v1/workflows/running | jq length) -gt 0 ]; do sleep 10; done' # 3. Rolling restart kubectl rollout restart deployment/workflow-engine -n agents kubectl rollout restart deployment/workflow-workers -n agents # 4. Wait for ready kubectl rollout status deployment/workflow-engine -n agents kubectl rollout status deployment/workflow-workers -n agents # 5. Resume workflow processing curl -X POST http://localhost:3003/api/v1/resume # 6. Verify health curl http://localhost:3003/health

Emergency Restart

# Force restart (may lose in-progress workflows) kubectl delete pods -n agents -l app=workflow-engine --force kubectl delete pods -n agents -l app=workflow-workers --force # Wait for recovery kubectl wait --for=condition=ready pod -l app=workflow-engine -n agents --timeout=120s # Check for orphaned workflows curl http://localhost:3003/api/v1/workflows/orphaned # Recover or cancel orphaned workflows curl -X POST http://localhost:3003/api/v1/workflows/recover-orphaned

Local Development Restart

# Docker Compose docker compose restart workflow-engine workflow-workers # Python/Celery pkill -f "celery.*workflow" && celery -A workflow worker -l info # Node.js/Bull pkill -f "workflow-engine" && npm run start:workflow

Logs Location

Kubernetes Logs

# Engine logs kubectl logs -f deployment/workflow-engine -n agents # Worker logs kubectl logs -f deployment/workflow-workers -n agents # All workflow-related logs kubectl logs -l app.kubernetes.io/component=workflow -n agents --all-containers # Filter by workflow ID kubectl logs deployment/workflow-engine -n agents | grep "workflow_id=abc123"

Local Logs

# Application logs tail -f /var/log/workflow-engine/app.log # Celery worker logs tail -f /var/log/workflow-engine/worker.log # Task execution logs tail -f /var/log/workflow-engine/tasks.log

Workflow Tracing

# Get workflow execution trace curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace # Export to JSON for analysis curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace > trace.json

Scaling

Worker Scaling

# Scale workers for more parallel tasks kubectl scale deployment/workflow-workers --replicas=10 -n agents # Enable HPA kubectl autoscale deployment/workflow-workers -n agents \ --min=2 --max=20 --cpu-percent=70

Engine Scaling

# Scale engine for more workflow coordination kubectl scale deployment/workflow-engine --replicas=3 -n agents

Queue-Based Scaling

# Scale based on queue depth (custom HPA) kubectl apply -f - <<EOF apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: workflow-workers-scaler namespace: agents spec: scaleTargetRef: name: workflow-workers minReplicaCount: 2 maxReplicaCount: 20 triggers: - type: redis metadata: address: redis:6379 listName: workflow:tasks:pending listLength: "10" EOF

Scaling Guidelines

MetricThresholdAction
Queue Depth> 100 pending tasksScale workers
Worker CPU> 70%Scale workers
Task Completion Rate< 10/minScale workers, investigate
Active Workflows> 50/engineScale engine

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
EngineDown0 healthy pods for 2minEmergency Restart
AllWorkersDown0 healthy workers for 2minRestart workers
WorkflowStuckCriticalWorkflow pending >1hrManual intervention
QueueBacklogCritical>1000 pending tasksScale immediately

Warning Alerts (Slack)

AlertConditionRunbook Action
HighTaskLatencyAvg task duration >5minInvestigate, scale
WorkerRestarting>3 restarts in 10minCheck memory, logs
QueueBacklog>100 pending tasksScale workers
WorkflowFailureRate>10% failure in 1hrInvestigate errors

Prometheus Alert Rules

groups: - name: workflow-engine rules: - alert: WorkflowEngineDown expr: up{job="workflow-engine"} == 0 for: 2m labels: severity: critical annotations: summary: "Workflow Engine is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/workflow-engine" - alert: WorkflowWorkersDown expr: up{job="workflow-workers"} == 0 for: 2m labels: severity: critical annotations: summary: "Workflow Workers are down" - alert: HighQueueDepth expr: workflow_queue_depth > 100 for: 10m labels: severity: warning annotations: summary: "Workflow queue depth high" - alert: WorkflowStuck expr: workflow_pending_duration_seconds > 3600 for: 5m labels: severity: warning annotations: summary: "Workflow stuck in pending state"

Monitoring Dashboards

  • Grafana: https://grafana.local/d/workflow-engine
  • Task Queue Dashboard: https://grafana.local/d/workflow-queue
  • Flower (Celery): http://localhost:5555 (if using Celery)
  • Bull Dashboard: http://localhost:3003/admin/queues (if using Bull)

Workflow Management Commands

Create Workflow

curl -X POST http://localhost:3003/api/v1/workflows \ -H "Content-Type: application/json" \ -d '{ "name": "data-pipeline", "steps": [ {"id": "fetch", "type": "http", "url": "..."}, {"id": "process", "type": "agent", "agent": "processor", "depends_on": ["fetch"]}, {"id": "store", "type": "database", "depends_on": ["process"]} ] }'

Cancel Workflow

curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel

Retry Failed Workflow

curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry

Contacts

  • On-call: PagerDuty rotation
  • Slack: #platform-incidents
  • Owner: Platform Team