Skip to main content

postgresql

PostgreSQL Runbook

Overview

  • Purpose: Primary relational database for persistent data storage. Stores agent configurations, workflow definitions, user data, session history, and application state.
  • Port: 5432
  • Health endpoint: TCP connection check or SELECT 1
  • Namespace: data (Kubernetes)
  • Version: PostgreSQL 15+

Dependencies

  • Persistent Volume (PVC) - Data storage
  • Backup Storage (S3/MinIO) - For backups

Database Layout

DatabasePurpose
agentsAgent configurations and state
workflowsWorkflow definitions and history
sessionsUser sessions and chat history
analyticsUsage metrics and logs

Common Issues

Issue 1: Connection Pool Exhaustion

  • Symptoms:
    • "too many connections" errors
    • Applications failing to connect
    • Slow query execution
  • Cause:
    • Connection leaks in applications
    • Too many concurrent connections
    • Connection pool misconfigured
  • Resolution:
    # Check current connections psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity;" # View connections by application psql -h localhost -U postgres -c "SELECT application_name, count(*) FROM pg_stat_activity GROUP BY application_name;" # Kill idle connections older than 10 minutes psql -h localhost -U postgres -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';" # Increase max connections (requires restart) kubectl exec -it postgresql-0 -n data -- bash -c "echo 'max_connections = 300' >> /var/lib/postgresql/data/postgresql.conf" kubectl rollout restart statefulset/postgresql -n data

Issue 2: High CPU / Slow Queries

  • Symptoms:
    • High CPU usage on PostgreSQL pod
    • Queries taking >10 seconds
    • Application timeouts
  • Cause:
    • Missing indexes
    • Full table scans
    • Lock contention
  • Resolution:
    # Find slow queries psql -h localhost -U postgres -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds';" # Check for missing indexes psql -h localhost -U postgres -c "SELECT schemaname, relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > idx_scan ORDER BY seq_scan DESC LIMIT 10;" # Analyze tables for query planner psql -h localhost -U postgres -c "ANALYZE;" # Kill long-running query psql -h localhost -U postgres -c "SELECT pg_cancel_backend(pid);" # or pg_terminate_backend for force # Create missing index (example) psql -h localhost -U postgres -c "CREATE INDEX CONCURRENTLY idx_sessions_user_id ON sessions(user_id);"

Issue 3: Disk Space Full

  • Symptoms:
    • "No space left on device" errors
    • INSERT/UPDATE failing
    • WAL files accumulating
  • Cause:
    • Large tables growing unbounded
    • WAL archiving failing
    • Bloated tables (dead tuples)
  • Resolution:
    # Check disk usage kubectl exec -it postgresql-0 -n data -- df -h /var/lib/postgresql/data # Check table sizes psql -h localhost -U postgres -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10;" # Clean up old WAL files kubectl exec -it postgresql-0 -n data -- pg_archivecleanup /var/lib/postgresql/data/pg_wal 0000000100000001000000XX # Vacuum tables to reclaim space psql -h localhost -U postgres -c "VACUUM FULL ANALYZE;" # Truncate old logs/sessions (if acceptable) psql -h localhost -U postgres -c "DELETE FROM sessions WHERE created_at < now() - interval '90 days';"

Issue 4: Replication Lag

  • Symptoms:
    • Read replicas returning stale data
    • "replication lag" metrics increasing
    • Standby database behind primary
  • Cause:
    • Network issues between primary and replica
    • Replica under-resourced
    • Heavy write load on primary
  • Resolution:
    # Check replication status psql -h localhost -U postgres -c "SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, (sent_lsn - replay_lsn) AS lag FROM pg_stat_replication;" # Check replica lag in bytes psql -h localhost -U postgres -c "SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes FROM pg_stat_replication;" # On replica, check recovery status psql -h replica -U postgres -c "SELECT pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp();" # If severely lagging, may need to rebuild replica # (see Disaster Recovery section)

Issue 5: Lock Contention

  • Symptoms:
    • Queries waiting indefinitely
    • "deadlock detected" errors
    • Transaction timeouts
  • Cause:
    • Long-running transactions holding locks
    • Concurrent updates to same rows
    • Missing proper transaction isolation
  • Resolution:
    # View current locks psql -h localhost -U postgres -c "SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS blocking_statement FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;" # Terminate blocking transaction psql -h localhost -U postgres -c "SELECT pg_terminate_backend(blocking_pid);" # Set lock timeout for future queries psql -h localhost -U postgres -c "ALTER SYSTEM SET lock_timeout = '30s';" psql -h localhost -U postgres -c "SELECT pg_reload_conf();"

Restart Procedure

# 1. Check for active transactions psql -h localhost -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';" # 2. Enable maintenance mode (prevent new connections) psql -h localhost -U postgres -c "ALTER DATABASE agents SET default_transaction_read_only = on;" # 3. Wait for transactions to complete sleep 30 # 4. Create checkpoint for clean shutdown psql -h localhost -U postgres -c "CHECKPOINT;" # 5. Restart PostgreSQL kubectl rollout restart statefulset/postgresql -n data # 6. Wait for ready kubectl wait --for=condition=ready pod -l app=postgresql -n data --timeout=300s # 7. Disable maintenance mode psql -h localhost -U postgres -c "ALTER DATABASE agents SET default_transaction_read_only = off;" # 8. Verify health psql -h localhost -U postgres -c "SELECT 1;"

Emergency Restart

# Force restart (may cause data loss of uncommitted transactions) kubectl delete pod postgresql-0 -n data --force # Wait for pod to restart kubectl wait --for=condition=ready pod postgresql-0 -n data --timeout=300s # Check for recovery psql -h localhost -U postgres -c "SELECT pg_is_in_recovery();" # Verify data integrity psql -h localhost -U postgres -c "SELECT count(*) FROM agents;"

Local Development Restart

# Docker docker restart postgresql # OrbStack orb restart postgresql # Homebrew (macOS) brew services restart postgresql@15

Logs Location

Kubernetes Logs

# PostgreSQL server logs kubectl logs -f statefulset/postgresql -n data # Filter for errors kubectl logs statefulset/postgresql -n data | grep -E "ERROR|FATAL|PANIC" # Export logs kubectl logs statefulset/postgresql -n data > pg-logs-$(date +%Y%m%d).txt

Log Files (Inside Container)

# Access log directory kubectl exec -it postgresql-0 -n data -- ls -la /var/lib/postgresql/data/log/ # View latest log kubectl exec -it postgresql-0 -n data -- tail -f /var/lib/postgresql/data/log/postgresql-*.log

Query Logging

# Enable query logging (temporary) psql -h localhost -U postgres -c "ALTER SYSTEM SET log_statement = 'all';" psql -h localhost -U postgres -c "SELECT pg_reload_conf();" # Disable after debugging psql -h localhost -U postgres -c "ALTER SYSTEM SET log_statement = 'none';" psql -h localhost -U postgres -c "SELECT pg_reload_conf();"

Scaling

Vertical Scaling

# Increase resources kubectl set resources statefulset/postgresql -n data \ --limits=cpu=4000m,memory=8Gi \ --requests=cpu=1000m,memory=2Gi

Connection Scaling

# Increase max connections kubectl exec -it postgresql-0 -n data -- psql -U postgres -c "ALTER SYSTEM SET max_connections = 500;" kubectl rollout restart statefulset/postgresql -n data

Read Replica Scaling

# Add read replica (using streaming replication) # Requires additional configuration - see PostgreSQL documentation # Route read queries to replica # Configure pgpool or application-level routing

Storage Scaling

# Expand PVC (if supported by storage class) kubectl patch pvc postgresql-data -n data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Scaling Guidelines

MetricThresholdAction
CPU Usage> 70%Increase CPU, optimize queries
Memory Usage> 80%Increase memory, tune shared_buffers
Connections> 80% maxIncrease max_connections
Disk Usage> 80%Expand storage, archive old data
Replication Lag> 1MBScale replica, check network

Backup & Recovery

Automated Backups

# Check backup status kubectl get cronjob pg-backup -n data # Manual backup trigger kubectl create job --from=cronjob/pg-backup manual-backup-$(date +%Y%m%d) -n data # List available backups aws s3 ls s3://backups/postgresql/ --recursive

Point-in-Time Recovery

# 1. Stop applications kubectl scale deployment --all --replicas=0 -n agents # 2. Restore from backup pg_restore -h localhost -U postgres -d agents --clean /path/to/backup.dump # 3. Apply WAL to specific point # (requires proper WAL archiving setup) # 4. Restart applications kubectl scale deployment --all --replicas=1 -n agents

Disaster Recovery

# Full restore from backup kubectl exec -it postgresql-0 -n data -- pg_restore -U postgres -d postgres --clean /backup/latest.dump # Rebuild replica from primary # (see PostgreSQL streaming replication documentation)

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
PostgreSQLDownCannot connect for 2minEmergency Restart
DiskSpaceCriticalDisk > 95% fullImmediate cleanup, expand storage
ConnectionsExhaustedConnections > 95% maxKill idle connections, increase limit

Warning Alerts (Slack)

AlertConditionRunbook Action
HighCPUCPU > 80% for 10minOptimize queries, scale
SlowQueriesQueries > 30sAnalyze, add indexes
ReplicationLagLag > 1MBCheck network, scale replica
DiskSpaceWarningDisk > 80%Plan expansion, archive

Prometheus Alert Rules

groups: - name: postgresql rules: - alert: PostgreSQLDown expr: pg_up == 0 for: 2m labels: severity: critical annotations: summary: "PostgreSQL is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/postgresql" - alert: PostgreSQLHighConnections expr: pg_stat_activity_count / pg_settings_max_connections > 0.8 for: 5m labels: severity: warning annotations: summary: "PostgreSQL connection pool nearly exhausted" - alert: PostgreSQLSlowQueries expr: rate(pg_stat_activity_max_tx_duration[5m]) > 60 for: 5m labels: severity: warning annotations: summary: "Long-running PostgreSQL queries detected"

Monitoring Dashboards

  • Grafana: https://grafana.local/d/postgresql
  • pgAdmin: http://localhost:5050 (if deployed)
  • Prometheus: https://prometheus.local/graph?g0.expr=pg_up

Contacts

  • On-call: PagerDuty rotation
  • Slack: #platform-incidents
  • Owner: Platform Team / DBA Team