Workflow Engine Runbook

** MIGRATED**: This runbook has been moved to the workflow-engine project wiki.

For project-specific runbooks: See workflow-engine.wiki/runbooks

For ecosystem patterns: See Workflow Engine Architecture in this wiki

Separation of Duties: See Separation of Duties - workflow-engine is responsible for workflow orchestration. It does NOT own agent manifests or execution.

Overview

Purpose: Workflow orchestration service managing multi-step agent workflows, task scheduling, parallel execution, and workflow state management. Coordinates complex agent interactions and business processes.
Port: 3003
Health endpoint: GET /health or GET /api/v1/health
Namespace: agents (Kubernetes)
Technology: Python/Celery or Node.js/Bull

Dependencies

Redis (port 6379) - Task queue and workflow state
PostgreSQL (port 5432) - Workflow definitions and history
Agent Mesh (port 3000) - Agent coordination
Agent Brain (port 3001) - Agent task execution

Common Issues

Issue 1: Workflow Stuck in Pending State

Symptoms:
- Workflows not progressing
- "pending" state for extended periods
- No worker pickup visible in logs
Cause:
- No available workers
- Redis queue connection lost
- Task serialization failure

Resolution:

# Check worker status
curl http://localhost:3003/api/v1/workers

# Check Redis queue depth
redis-cli -h localhost -p 6379 llen "workflow:tasks:pending"

# Check for dead workers
curl http://localhost:3003/api/v1/workers/dead

# Restart workers
kubectl rollout restart deployment/workflow-workers -n agents

# Manually retry stuck workflow
curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry

Issue 2: Task Timeout

Symptoms:
- Tasks failing with timeout errors
- Partial workflow completion
- "Task exceeded maximum runtime" in logs
Cause:
- Long-running LLM calls
- Downstream service slow
- Timeout configured too low

Resolution:

# Check task execution times
curl http://localhost:3003/api/v1/tasks/stats | jq '.avg_duration'

# Identify slow tasks
curl http://localhost:3003/api/v1/tasks/slow?threshold=60s

# Increase timeout for specific task type
curl -X PATCH http://localhost:3003/api/v1/task-types/llm_inference \
  -H "Content-Type: application/json" \
  -d '{"timeout": 300}'

# Retry timed-out task with extended timeout
curl -X POST http://localhost:3003/api/v1/tasks/{task_id}/retry?timeout=600

Issue 3: Parallel Execution Deadlock

Symptoms:
- Multiple tasks waiting for each other
- Workflow progress halted
- Circular dependency detected warnings
Cause:
- Poorly designed workflow DAG
- Resource contention
- Lock not released

Resolution:

# Visualize workflow DAG
curl http://localhost:3003/api/v1/workflows/{workflow_id}/dag > dag.json

# Check for locks
redis-cli -h localhost -p 6379 keys "workflow:lock:*"

# Force release locks (use with caution)
redis-cli -h localhost -p 6379 del "workflow:lock:{workflow_id}"

# Cancel deadlocked workflow
curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel

# Enable deadlock detection
kubectl set env deployment/workflow-engine -n agents DEADLOCK_DETECTION=true

Issue 4: Worker Memory Exhaustion

Symptoms:
- OOMKilled events
- Workers restarting frequently
- Memory climbing over time
Cause:
- Large payloads in tasks
- Memory leaks in task handlers
- Too many concurrent tasks per worker

Resolution:

# Check worker memory
kubectl top pods -n agents -l app=workflow-workers

# Reduce concurrency
kubectl set env deployment/workflow-workers -n agents WORKER_CONCURRENCY=2

# Enable memory profiling
kubectl set env deployment/workflow-workers -n agents MEMORY_PROFILER=true

# Restart workers with new limits
kubectl set resources deployment/workflow-workers -n agents \
  --limits=memory=2Gi --requests=memory=512Mi

Issue 5: Workflow Versioning Conflict

Symptoms:
- Old workflow version executing
- "Workflow definition not found" errors
- Inconsistent behavior between runs
Cause:
- Workflow definition updated mid-execution
- Cache serving stale definition
- Database replication lag

Resolution:

# Check active workflow versions
curl http://localhost:3003/api/v1/workflows/versions

# Clear workflow definition cache
redis-cli -h localhost -p 6379 del "workflow:definitions:*"

# Force reload definitions
curl -X POST http://localhost:3003/api/v1/workflows/reload

# Pin specific version for execution
curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/pin?version=2

Restart Procedure

Graceful Restart (Recommended)

# 1. Stop accepting new workflows
curl -X POST http://localhost:3003/api/v1/pause

# 2. Wait for running workflows to complete (or timeout after 5min)
timeout 300 bash -c 'while [ $(curl -s http://localhost:3003/api/v1/workflows/running | jq length) -gt 0 ]; do sleep 10; done'

# 3. Rolling restart
kubectl rollout restart deployment/workflow-engine -n agents
kubectl rollout restart deployment/workflow-workers -n agents

# 4. Wait for ready
kubectl rollout status deployment/workflow-engine -n agents
kubectl rollout status deployment/workflow-workers -n agents

# 5. Resume workflow processing
curl -X POST http://localhost:3003/api/v1/resume

# 6. Verify health
curl http://localhost:3003/health

Emergency Restart

# Force restart (may lose in-progress workflows)
kubectl delete pods -n agents -l app=workflow-engine --force
kubectl delete pods -n agents -l app=workflow-workers --force

# Wait for recovery
kubectl wait --for=condition=ready pod -l app=workflow-engine -n agents --timeout=120s

# Check for orphaned workflows
curl http://localhost:3003/api/v1/workflows/orphaned

# Recover or cancel orphaned workflows
curl -X POST http://localhost:3003/api/v1/workflows/recover-orphaned

Local Development Restart

# Docker Compose
docker compose restart workflow-engine workflow-workers

# Python/Celery
pkill -f "celery.*workflow" && celery -A workflow worker -l info

# Node.js/Bull
pkill -f "workflow-engine" && npm run start:workflow

Logs Location

Kubernetes Logs

# Engine logs
kubectl logs -f deployment/workflow-engine -n agents

# Worker logs
kubectl logs -f deployment/workflow-workers -n agents

# All workflow-related logs
kubectl logs -l app.kubernetes.io/component=workflow -n agents --all-containers

# Filter by workflow ID
kubectl logs deployment/workflow-engine -n agents | grep "workflow_id=abc123"

Local Logs

# Application logs
tail -f /var/log/workflow-engine/app.log

# Celery worker logs
tail -f /var/log/workflow-engine/worker.log

# Task execution logs
tail -f /var/log/workflow-engine/tasks.log

Workflow Tracing

# Get workflow execution trace
curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace

# Export to JSON for analysis
curl http://localhost:3003/api/v1/workflows/{workflow_id}/trace > trace.json

Scaling

Worker Scaling

# Scale workers for more parallel tasks
kubectl scale deployment/workflow-workers --replicas=10 -n agents

# Enable HPA
kubectl autoscale deployment/workflow-workers -n agents \
  --min=2 --max=20 --cpu-percent=70

Engine Scaling

# Scale engine for more workflow coordination
kubectl scale deployment/workflow-engine --replicas=3 -n agents

Queue-Based Scaling

# Scale based on queue depth (custom HPA)
kubectl apply -f - <<EOF
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: workflow-workers-scaler
  namespace: agents
spec:
  scaleTargetRef:
    name: workflow-workers
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
    - type: redis
      metadata:
        address: redis:6379
        listName: workflow:tasks:pending
        listLength: "10"
EOF

Scaling Guidelines

Metric	Threshold	Action
Queue Depth	> 100 pending tasks	Scale workers
Worker CPU	> 70%	Scale workers
Task Completion Rate	< 10/min	Scale workers, investigate
Active Workflows	> 50/engine	Scale engine

Alerts

Critical Alerts (PagerDuty)

Alert	Condition	Runbook Action
EngineDown	0 healthy pods for 2min	Emergency Restart
AllWorkersDown	0 healthy workers for 2min	Restart workers
WorkflowStuckCritical	Workflow pending >1hr	Manual intervention
QueueBacklogCritical	>1000 pending tasks	Scale immediately

Warning Alerts (Slack)

Alert	Condition	Runbook Action
HighTaskLatency	Avg task duration >5min	Investigate, scale
WorkerRestarting	>3 restarts in 10min	Check memory, logs
QueueBacklog	>100 pending tasks	Scale workers
WorkflowFailureRate	>10% failure in 1hr	Investigate errors

Prometheus Alert Rules

groups:
  - name: workflow-engine
    rules:
      - alert: WorkflowEngineDown
        expr: up{job="workflow-engine"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Workflow Engine is down"
          runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/workflow-engine"

      - alert: WorkflowWorkersDown
        expr: up{job="workflow-workers"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Workflow Workers are down"

      - alert: HighQueueDepth
        expr: workflow_queue_depth > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Workflow queue depth high"

      - alert: WorkflowStuck
        expr: workflow_pending_duration_seconds > 3600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Workflow stuck in pending state"

Monitoring Dashboards

Grafana: https://grafana.local/d/workflow-engine
Task Queue Dashboard: https://grafana.local/d/workflow-queue
Flower (Celery): http://localhost:5555 (if using Celery)
Bull Dashboard: http://localhost:3003/admin/queues (if using Bull)

Workflow Management Commands

Create Workflow

curl -X POST http://localhost:3003/api/v1/workflows \
  -H "Content-Type: application/json" \
  -d '{
    "name": "data-pipeline",
    "steps": [
      {"id": "fetch", "type": "http", "url": "..."},
      {"id": "process", "type": "agent", "agent": "processor", "depends_on": ["fetch"]},
      {"id": "store", "type": "database", "depends_on": ["process"]}
    ]
  }'

Cancel Workflow

curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/cancel

Retry Failed Workflow

curl -X POST http://localhost:3003/api/v1/workflows/{workflow_id}/retry

Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents
Owner: Platform Team

Redis Runbook - Task queue backend
PostgreSQL Runbook - Workflow storage
Agent Mesh Runbook - Agent coordination
Agent Brain Runbook - Task execution

Workflow Engine Runbook

Workflow Engine Runbook

Overview

Dependencies

Common Issues

Issue 1: Workflow Stuck in Pending State

Issue 2: Task Timeout

Issue 3: Parallel Execution Deadlock

Issue 4: Worker Memory Exhaustion

Issue 5: Workflow Versioning Conflict

Restart Procedure

Graceful Restart (Recommended)

Emergency Restart

Local Development Restart

Logs Location

Kubernetes Logs

Local Logs

Workflow Tracing

Scaling

Worker Scaling

Engine Scaling

Queue-Based Scaling

Scaling Guidelines

Alerts

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Prometheus Alert Rules

Monitoring Dashboards

Workflow Management Commands

Create Workflow

Cancel Workflow

Retry Failed Workflow

Contacts

Related Runbooks