PostgreSQL Runbook
Overview
- Purpose: Primary relational database for persistent data storage. Stores agent configurations, workflow definitions, user data, session history, and application state.
- Port: 5432
- Health endpoint: TCP connection check or
SELECT 1
- Namespace:
data (Kubernetes)
- Version: PostgreSQL 15+
Dependencies
- Persistent Volume (PVC) - Data storage
- Backup Storage (S3/MinIO) - For backups
Database Layout
| Database | Purpose |
|---|
agents | Agent configurations and state |
workflows | Workflow definitions and history |
sessions | User sessions and chat history |
analytics | Usage metrics and logs |
Common Issues
Issue 1: Connection Pool Exhaustion
- Symptoms:
- "too many connections" errors
- Applications failing to connect
- Slow query execution
- Cause:
- Connection leaks in applications
- Too many concurrent connections
- Connection pool misconfigured
- Resolution:
# Check current connections
psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity;"
# View connections by application
psql -h localhost -U postgres -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name;"
# Kill idle connections older than 10 minutes
psql -h localhost -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
# Increase max connections (requires restart)
kubectl exec -it postgresql-0 -n data -- bash -c "echo 'max_connections = 300' >> /var/lib/postgresql/data/postgresql.conf"
kubectl rollout restart statefulset/postgresql -n data
Issue 2: High CPU / Slow Queries
- Symptoms:
- High CPU usage on PostgreSQL pod
- Queries taking >10 seconds
- Application timeouts
- Cause:
- Missing indexes
- Full table scans
- Lock contention
- Resolution:
# Find slow queries
psql -h localhost -U postgres -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds';"
# Check for missing indexes
psql -h localhost -U postgres -c "SELECT schemaname, relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY seq_scan DESC LIMIT 10;"
# Analyze tables for query planner
psql -h localhost -U postgres -c "ANALYZE;"
# Kill long-running query
psql -h localhost -U postgres -c "SELECT pg_cancel_backend(pid);" # or pg_terminate_backend for force
# Create missing index (example)
psql -h localhost -U postgres -c "CREATE INDEX CONCURRENTLY idx_sessions_user_id ON sessions(user_id);"
Issue 3: Disk Space Full
- Symptoms:
- "No space left on device" errors
- INSERT/UPDATE failing
- WAL files accumulating
- Cause:
- Large tables growing unbounded
- WAL archiving failing
- Bloated tables (dead tuples)
- Resolution:
# Check disk usage
kubectl exec -it postgresql-0 -n data -- df -h /var/lib/postgresql/data
# Check table sizes
psql -h localhost -U postgres -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;"
# Clean up old WAL files
kubectl exec -it postgresql-0 -n data -- pg_archivecleanup /var/lib/postgresql/data/pg_wal 0000000100000001000000XX
# Vacuum tables to reclaim space
psql -h localhost -U postgres -c "VACUUM FULL ANALYZE;"
# Truncate old logs/sessions (if acceptable)
psql -h localhost -U postgres -c "DELETE FROM sessions WHERE created_at < now() - interval '90 days';"
Issue 4: Replication Lag
- Symptoms:
- Read replicas returning stale data
- "replication lag" metrics increasing
- Standby database behind primary
- Cause:
- Network issues between primary and replica
- Replica under-resourced
- Heavy write load on primary
- Resolution:
# Check replication status
psql -h localhost -U postgres -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag FROM pg_stat_replication;"
# Check replica lag in bytes
psql -h localhost -U postgres -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication;"
# On replica, check recovery status
psql -h replica -U postgres -c "SELECT pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();"
# If severely lagging, may need to rebuild replica
# (see Disaster Recovery section)
Issue 5: Lock Contention
- Symptoms:
- Queries waiting indefinitely
- "deadlock detected" errors
- Transaction timeouts
- Cause:
- Long-running transactions holding locks
- Concurrent updates to same rows
- Missing proper transaction isolation
- Resolution:
# View current locks
psql -h localhost -U postgres -c "SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;"
# Terminate blocking transaction
psql -h localhost -U postgres -c "SELECT pg_terminate_backend(blocking_pid);"
# Set lock timeout for future queries
psql -h localhost -U postgres -c "ALTER SYSTEM SET lock_timeout = '30s';"
psql -h localhost -U postgres -c "SELECT pg_reload_conf();"
Restart Procedure
Graceful Restart (Recommended)
# 1. Check for active transactions
psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
# 2. Enable maintenance mode (prevent new connections)
psql -h localhost -U postgres -c "ALTER DATABASE agents SET default_transaction_read_only = on;"
# 3. Wait for transactions to complete
sleep 30
# 4. Create checkpoint for clean shutdown
psql -h localhost -U postgres -c "CHECKPOINT;"
# 5. Restart PostgreSQL
kubectl rollout restart statefulset/postgresql -n data
# 6. Wait for ready
kubectl wait --for=condition=ready pod -l app=postgresql -n data --timeout=300s
# 7. Disable maintenance mode
psql -h localhost -U postgres -c "ALTER DATABASE agents SET default_transaction_read_only = off;"
# 8. Verify health
psql -h localhost -U postgres -c "SELECT 1;"
Emergency Restart
# Force restart (may cause data loss of uncommitted transactions)
kubectl delete pod postgresql-0 -n data --force
# Wait for pod to restart
kubectl wait --for=condition=ready pod postgresql-0 -n data --timeout=300s
# Check for recovery
psql -h localhost -U postgres -c "SELECT pg_is_in_recovery();"
# Verify data integrity
psql -h localhost -U postgres -c "SELECT count(*) FROM agents;"
Local Development Restart
# Docker
docker restart postgresql
# OrbStack
orb restart postgresql
# Homebrew (macOS)
brew services restart postgresql@15
Logs Location
Kubernetes Logs
# PostgreSQL server logs
kubectl logs -f statefulset/postgresql -n data
# Filter for errors
kubectl logs statefulset/postgresql -n data | grep -E "ERROR|FATAL|PANIC"
# Export logs
kubectl logs statefulset/postgresql -n data > pg-logs-$(date +%Y%m%d).txt
Log Files (Inside Container)
# Access log directory
kubectl exec -it postgresql-0 -n data -- ls -la /var/lib/postgresql/data/log/
# View latest log
kubectl exec -it postgresql-0 -n data -- tail -f /var/lib/postgresql/data/log/postgresql-*.log
Query Logging
# Enable query logging (temporary)
psql -h localhost -U postgres -c "ALTER SYSTEM SET log_statement = 'all';"
psql -h localhost -U postgres -c "SELECT pg_reload_conf();"
# Disable after debugging
psql -h localhost -U postgres -c "ALTER SYSTEM SET log_statement = 'none';"
psql -h localhost -U postgres -c "SELECT pg_reload_conf();"
Scaling
Vertical Scaling
# Increase resources
kubectl set resources statefulset/postgresql -n data \
--limits=cpu=4000m,memory=8Gi \
--requests=cpu=1000m,memory=2Gi
Connection Scaling
# Increase max connections
kubectl exec -it postgresql-0 -n data -- psql -U postgres -c "ALTER SYSTEM SET max_connections = 500;"
kubectl rollout restart statefulset/postgresql -n data
Read Replica Scaling
# Add read replica (using streaming replication)
# Requires additional configuration - see PostgreSQL documentation
# Route read queries to replica
# Configure pgpool or application-level routing
Storage Scaling
# Expand PVC (if supported by storage class)
kubectl patch pvc postgresql-data -n data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
Scaling Guidelines
| Metric | Threshold | Action |
|---|
| CPU Usage | > 70% | Increase CPU, optimize queries |
| Memory Usage | > 80% | Increase memory, tune shared_buffers |
| Connections | > 80% max | Increase max_connections |
| Disk Usage | > 80% | Expand storage, archive old data |
| Replication Lag | > 1MB | Scale replica, check network |
Backup & Recovery
Automated Backups
# Check backup status
kubectl get cronjob pg-backup -n data
# Manual backup trigger
kubectl create job --from=cronjob/pg-backup manual-backup-$(date +%Y%m%d) -n data
# List available backups
aws s3 ls s3://backups/postgresql/ --recursive
Point-in-Time Recovery
# 1. Stop applications
kubectl scale deployment --all --replicas=0 -n agents
# 2. Restore from backup
pg_restore -h localhost -U postgres -d agents --clean /path/to/backup.dump
# 3. Apply WAL to specific point
# (requires proper WAL archiving setup)
# 4. Restart applications
kubectl scale deployment --all --replicas=1 -n agents
Disaster Recovery
# Full restore from backup
kubectl exec -it postgresql-0 -n data -- pg_restore -U postgres -d postgres --clean /backup/latest.dump
# Rebuild replica from primary
# (see PostgreSQL streaming replication documentation)
Alerts
| Alert | Condition | Runbook Action |
|---|
| PostgreSQLDown | Cannot connect for 2min | Emergency Restart |
| DiskSpaceCritical | Disk > 95% full | Immediate cleanup, expand storage |
| ConnectionsExhausted | Connections > 95% max | Kill idle connections, increase limit |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|
| HighCPU | CPU > 80% for 10min | Optimize queries, scale |
| SlowQueries | Queries > 30s | Analyze, add indexes |
| ReplicationLag | Lag > 1MB | Check network, scale replica |
| DiskSpaceWarning | Disk > 80% | Plan expansion, archive |
Prometheus Alert Rules
groups:
- name: postgresql
rules:
- alert: PostgreSQLDown
expr: pg_up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "PostgreSQL is down"
runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/postgresql"
- alert: PostgreSQLHighConnections
expr: pg_stat_activity_count / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "PostgreSQL connection pool nearly exhausted"
- alert: PostgreSQLSlowQueries
expr: rate(pg_stat_activity_max_tx_duration[5m]) > 60
for: 5m
labels:
severity: warning
annotations:
summary: "Long-running PostgreSQL queries detected"
Monitoring Dashboards
- Grafana:
https://grafana.local/d/postgresql
- pgAdmin:
http://localhost:5050 (if deployed)
- Prometheus:
https://prometheus.local/graph?g0.expr=pg_up
- On-call: PagerDuty rotation
- Slack: #platform-incidents
- Owner: Platform Team / DBA Team
- Redis Runbook - Caching layer
- Agent Brain Runbook - Primary consumer
- Workflow Engine Runbook - Workflow storage