Skip to main content

redis

Redis Runbook

Overview

  • Purpose: In-memory data store for caching, session management, pub/sub messaging, task queues, and rate limiting across the agent platform.
  • Port: 6379
  • Health endpoint: PING command returns PONG
  • Namespace: data (Kubernetes)
  • Version: Redis 7+

Dependencies

  • Persistent Volume (PVC) - For RDB/AOF persistence

Key Namespaces

PrefixPurpose
session:*User sessions
cache:*Application caches
queue:*Task queues (Bull/Celery)
rate:*Rate limiting counters
mesh:*Agent mesh state
workflow:*Workflow task queues
lock:*Distributed locks

Common Issues

Issue 1: Memory Limit Reached

  • Symptoms:
    • OOM errors in logs
    • Write commands failing
    • "maxmemory reached" errors
  • Cause:
    • Cache not expiring
    • Memory leak (keys not being deleted)
    • Insufficient maxmemory setting
  • Resolution:
    # Check memory usage redis-cli -h localhost -p 6379 INFO memory # Find large keys redis-cli -h localhost -p 6379 --bigkeys # Check keys by pattern redis-cli -h localhost -p 6379 DBSIZE redis-cli -h localhost -p 6379 KEYS "cache:*" | wc -l # Enable memory eviction (if not set) redis-cli -h localhost -p 6379 CONFIG SET maxmemory-policy allkeys-lru # Manually expire large cache namespace redis-cli -h localhost -p 6379 KEYS "cache:old:*" | xargs redis-cli DEL # Increase maxmemory (requires restart for persistence) redis-cli -h localhost -p 6379 CONFIG SET maxmemory 2gb

Issue 2: Connection Refused

  • Symptoms:
    • Applications unable to connect
    • "Connection refused" errors
    • TCP connections timing out
  • Cause:
    • Redis process crashed
    • Max connections reached
    • Network/firewall issues
  • Resolution:
    # Check if Redis is running kubectl get pods -n data -l app=redis # Check Redis logs kubectl logs -f statefulset/redis -n data # Check current connections redis-cli -h localhost -p 6379 CLIENT LIST | wc -l # Check max connections redis-cli -h localhost -p 6379 CONFIG GET maxclients # Kill idle connections redis-cli -h localhost -p 6379 CLIENT KILL TYPE normal SKIPME yes # Restart Redis if unresponsive kubectl rollout restart statefulset/redis -n data

Issue 3: Slow Commands / High Latency

  • Symptoms:
    • Commands taking >10ms
    • Application timeouts
    • "slowlog" entries accumulating
  • Cause:
    • Large key operations (KEYS, SMEMBERS on large sets)
    • Blocking commands
    • Memory swapping
  • Resolution:
    # Check slowlog redis-cli -h localhost -p 6379 SLOWLOG GET 10 # Monitor in real-time redis-cli -h localhost -p 6379 MONITOR # (Ctrl+C to stop, use briefly) # Check latency redis-cli -h localhost -p 6379 --latency # Check for memory swapping redis-cli -h localhost -p 6379 INFO | grep swap # Identify problematic commands redis-cli -h localhost -p 6379 INFO commandstats | grep "cmdstat_keys\|cmdstat_smembers" # Replace KEYS with SCAN in application code # Use SSCAN instead of SMEMBERS for large sets

Issue 4: Data Loss After Restart

  • Symptoms:
    • Keys missing after pod restart
    • Sessions expired unexpectedly
    • Cache empty after recovery
  • Cause:
    • Persistence not configured
    • RDB/AOF corrupted
    • PVC not mounted correctly
  • Resolution:
    # Check persistence configuration redis-cli -h localhost -p 6379 CONFIG GET save redis-cli -h localhost -p 6379 CONFIG GET appendonly # Check PVC status kubectl get pvc -n data -l app=redis # Verify data directory kubectl exec -it redis-0 -n data -- ls -la /data # Enable AOF persistence redis-cli -h localhost -p 6379 CONFIG SET appendonly yes # Force RDB save redis-cli -h localhost -p 6379 BGSAVE # Check last save status redis-cli -h localhost -p 6379 LASTSAVE

Issue 5: Pub/Sub Not Working

  • Symptoms:
    • Subscribers not receiving messages
    • Agent mesh communication broken
    • "No subscribers for channel" in logs
  • Cause:
    • Subscribers disconnected
    • Wrong channel name
    • Network partitioning
  • Resolution:
    # Check active subscriptions redis-cli -h localhost -p 6379 PUBSUB NUMSUB channel1 channel2 # List all channels with subscribers redis-cli -h localhost -p 6379 PUBSUB CHANNELS "*" # Test pub/sub manually # Terminal 1: redis-cli SUBSCRIBE test-channel # Terminal 2: redis-cli PUBLISH test-channel "hello" # Check client connections redis-cli -h localhost -p 6379 CLIENT LIST | grep sub # Restart subscriber applications kubectl rollout restart deployment/agent-mesh -n agents

Issue 6: Cluster Split (Sentinel/Cluster Mode)

  • Symptoms:
    • Write failures
    • Inconsistent reads
    • Failover not completing
  • Cause:
    • Network partition between nodes
    • Sentinel quorum lost
    • Master election failing
  • Resolution:
    # Check Sentinel status (if using Sentinel) redis-cli -p 26379 SENTINEL master mymaster redis-cli -p 26379 SENTINEL slaves mymaster # Check cluster status (if using Redis Cluster) redis-cli -h localhost -p 6379 CLUSTER INFO redis-cli -h localhost -p 6379 CLUSTER NODES # Force failover redis-cli -p 26379 SENTINEL FAILOVER mymaster # Reset cluster node redis-cli -h localhost -p 6379 CLUSTER RESET SOFT

Restart Procedure

# 1. Check for critical operations redis-cli -h localhost -p 6379 CLIENT LIST | grep -c "cmd=" # 2. Force RDB save redis-cli -h localhost -p 6379 BGSAVE # 3. Wait for save to complete while [ $(redis-cli LASTSAVE) -eq $(redis-cli LASTSAVE) ]; do sleep 1; done # 4. Perform rolling restart kubectl rollout restart statefulset/redis -n data # 5. Wait for ready kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s # 6. Verify health redis-cli -h localhost -p 6379 PING

Emergency Restart

# Force restart kubectl delete pod redis-0 -n data --force # Wait for recovery kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s # Check data integrity redis-cli -h localhost -p 6379 DBSIZE

Local Development Restart

# Docker docker restart redis # OrbStack orb restart redis # Homebrew (macOS) brew services restart redis

Logs Location

Kubernetes Logs

# Redis logs kubectl logs -f statefulset/redis -n data # Filter for warnings/errors kubectl logs statefulset/redis -n data | grep -E "WARNING|ERROR" # Export logs kubectl logs statefulset/redis -n data > redis-logs-$(date +%Y%m%d).txt

Inside Container

# Redis log file (if configured) kubectl exec -it redis-0 -n data -- cat /var/log/redis/redis.log

Command Monitoring

# Real-time command monitoring (use briefly, impacts performance) redis-cli -h localhost -p 6379 MONITOR | head -100 # Slow query log redis-cli -h localhost -p 6379 SLOWLOG GET 25

Scaling

Vertical Scaling

# Increase memory limit kubectl set resources statefulset/redis -n data \ --limits=cpu=2000m,memory=4Gi \ --requests=cpu=500m,memory=1Gi # Increase Redis maxmemory redis-cli -h localhost -p 6379 CONFIG SET maxmemory 3gb

Horizontal Scaling (Read Replicas)

# Add read replica kubectl scale statefulset/redis-replica -n data --replicas=3 # Configure clients for read-replica routing # (Application-level configuration required)

Cluster Mode

# For true horizontal scaling, migrate to Redis Cluster # Requires significant architectural changes # See Redis Cluster documentation

Scaling Guidelines

MetricThresholdAction
Memory Usage> 80%Increase maxmemory, add eviction
CPU Usage> 70%Scale vertically, add replicas
Connections> 80% maxclientsIncrease limit, add replicas
Commands/sec> 100KAdd read replicas
Latency P99> 5msCheck slowlog, optimize keys

Alerts

Critical Alerts (PagerDuty)

AlertConditionRunbook Action
RedisDownCannot connect for 1minEmergency Restart
MemoryFullMemory usage 100%Flush cache, increase memory
PersistenceFailureRDB/AOF save failingCheck disk, fix persistence

Warning Alerts (Slack)

AlertConditionRunbook Action
HighMemoryMemory > 80%Review keys, increase limit
HighConnectionsConnections > 80%Check for leaks, increase limit
SlowCommandsSlowlog entries > 10/minOptimize commands
ReplicationLagReplica lag > 1sCheck network

Prometheus Alert Rules

groups: - name: redis rules: - alert: RedisDown expr: redis_up == 0 for: 1m labels: severity: critical annotations: summary: "Redis is down" runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/redis" - alert: RedisMemoryHigh expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8 for: 5m labels: severity: warning annotations: summary: "Redis memory usage high" - alert: RedisConnectionsHigh expr: redis_connected_clients / redis_config_maxclients > 0.8 for: 5m labels: severity: warning annotations: summary: "Redis connection pool nearly exhausted"

Monitoring Dashboards

  • Grafana: https://grafana.local/d/redis
  • Redis Insight: http://localhost:8001 (if deployed)
  • Prometheus: https://prometheus.local/graph?g0.expr=redis_up

Useful Redis Commands

# Memory analysis redis-cli -h localhost -p 6379 MEMORY DOCTOR redis-cli -h localhost -p 6379 MEMORY STATS # Key analysis redis-cli -h localhost -p 6379 --bigkeys redis-cli -h localhost -p 6379 --memkeys # Server info redis-cli -h localhost -p 6379 INFO redis-cli -h localhost -p 6379 INFO replication redis-cli -h localhost -p 6379 INFO persistence # Flush specific database (use with caution) redis-cli -h localhost -p 6379 SELECT 1 && FLUSHDB # Scan for keys matching pattern redis-cli -h localhost -p 6379 SCAN 0 MATCH "cache:*" COUNT 100

Contacts

  • On-call: PagerDuty rotation
  • Slack: #platform-incidents
  • Owner: Platform Team