Redis Runbook
Overview
- Purpose: In-memory data store for caching, session management, pub/sub messaging, task queues, and rate limiting across the agent platform.
- Port: 6379
- Health endpoint:
PING command returns PONG
- Namespace:
data (Kubernetes)
- Version: Redis 7+
Dependencies
- Persistent Volume (PVC) - For RDB/AOF persistence
Key Namespaces
| Prefix | Purpose |
|---|
session:* | User sessions |
cache:* | Application caches |
queue:* | Task queues (Bull/Celery) |
rate:* | Rate limiting counters |
mesh:* | Agent mesh state |
workflow:* | Workflow task queues |
lock:* | Distributed locks |
Common Issues
Issue 1: Memory Limit Reached
- Symptoms:
- OOM errors in logs
- Write commands failing
- "maxmemory reached" errors
- Cause:
- Cache not expiring
- Memory leak (keys not being deleted)
- Insufficient maxmemory setting
- Resolution:
# Check memory usage
redis-cli -h localhost -p 6379 INFO memory
# Find large keys
redis-cli -h localhost -p 6379 --bigkeys
# Check keys by pattern
redis-cli -h localhost -p 6379 DBSIZE
redis-cli -h localhost -p 6379 KEYS "cache:*" | wc -l
# Enable memory eviction (if not set)
redis-cli -h localhost -p 6379 CONFIG SET maxmemory-policy allkeys-lru
# Manually expire large cache namespace
redis-cli -h localhost -p 6379 KEYS "cache:old:*" | xargs redis-cli DEL
# Increase maxmemory (requires restart for persistence)
redis-cli -h localhost -p 6379 CONFIG SET maxmemory 2gb
Issue 2: Connection Refused
- Symptoms:
- Applications unable to connect
- "Connection refused" errors
- TCP connections timing out
- Cause:
- Redis process crashed
- Max connections reached
- Network/firewall issues
- Resolution:
# Check if Redis is running
kubectl get pods -n data -l app=redis
# Check Redis logs
kubectl logs -f statefulset/redis -n data
# Check current connections
redis-cli -h localhost -p 6379 CLIENT LIST | wc -l
# Check max connections
redis-cli -h localhost -p 6379 CONFIG GET maxclients
# Kill idle connections
redis-cli -h localhost -p 6379 CLIENT KILL TYPE normal SKIPME yes
# Restart Redis if unresponsive
kubectl rollout restart statefulset/redis -n data
Issue 3: Slow Commands / High Latency
- Symptoms:
- Commands taking >10ms
- Application timeouts
- "slowlog" entries accumulating
- Cause:
- Large key operations (KEYS, SMEMBERS on large sets)
- Blocking commands
- Memory swapping
- Resolution:
# Check slowlog
redis-cli -h localhost -p 6379 SLOWLOG GET 10
# Monitor in real-time
redis-cli -h localhost -p 6379 MONITOR # (Ctrl+C to stop, use briefly)
# Check latency
redis-cli -h localhost -p 6379 --latency
# Check for memory swapping
redis-cli -h localhost -p 6379 INFO | grep swap
# Identify problematic commands
redis-cli -h localhost -p 6379 INFO commandstats | grep "cmdstat_keys\|cmdstat_smembers"
# Replace KEYS with SCAN in application code
# Use SSCAN instead of SMEMBERS for large sets
Issue 4: Data Loss After Restart
- Symptoms:
- Keys missing after pod restart
- Sessions expired unexpectedly
- Cache empty after recovery
- Cause:
- Persistence not configured
- RDB/AOF corrupted
- PVC not mounted correctly
- Resolution:
# Check persistence configuration
redis-cli -h localhost -p 6379 CONFIG GET save
redis-cli -h localhost -p 6379 CONFIG GET appendonly
# Check PVC status
kubectl get pvc -n data -l app=redis
# Verify data directory
kubectl exec -it redis-0 -n data -- ls -la /data
# Enable AOF persistence
redis-cli -h localhost -p 6379 CONFIG SET appendonly yes
# Force RDB save
redis-cli -h localhost -p 6379 BGSAVE
# Check last save status
redis-cli -h localhost -p 6379 LASTSAVE
Issue 5: Pub/Sub Not Working
- Symptoms:
- Subscribers not receiving messages
- Agent mesh communication broken
- "No subscribers for channel" in logs
- Cause:
- Subscribers disconnected
- Wrong channel name
- Network partitioning
- Resolution:
# Check active subscriptions
redis-cli -h localhost -p 6379 PUBSUB NUMSUB channel1 channel2
# List all channels with subscribers
redis-cli -h localhost -p 6379 PUBSUB CHANNELS "*"
# Test pub/sub manually
# Terminal 1: redis-cli SUBSCRIBE test-channel
# Terminal 2: redis-cli PUBLISH test-channel "hello"
# Check client connections
redis-cli -h localhost -p 6379 CLIENT LIST | grep sub
# Restart subscriber applications
kubectl rollout restart deployment/agent-mesh -n agents
Issue 6: Cluster Split (Sentinel/Cluster Mode)
- Symptoms:
- Write failures
- Inconsistent reads
- Failover not completing
- Cause:
- Network partition between nodes
- Sentinel quorum lost
- Master election failing
- Resolution:
# Check Sentinel status (if using Sentinel)
redis-cli -p 26379 SENTINEL master mymaster
redis-cli -p 26379 SENTINEL slaves mymaster
# Check cluster status (if using Redis Cluster)
redis-cli -h localhost -p 6379 CLUSTER INFO
redis-cli -h localhost -p 6379 CLUSTER NODES
# Force failover
redis-cli -p 26379 SENTINEL FAILOVER mymaster
# Reset cluster node
redis-cli -h localhost -p 6379 CLUSTER RESET SOFT
Restart Procedure
Graceful Restart (Recommended)
# 1. Check for critical operations
redis-cli -h localhost -p 6379 CLIENT LIST | grep -c "cmd="
# 2. Force RDB save
redis-cli -h localhost -p 6379 BGSAVE
# 3. Wait for save to complete
while [ $(redis-cli LASTSAVE) -eq $(redis-cli LASTSAVE) ]; do sleep 1; done
# 4. Perform rolling restart
kubectl rollout restart statefulset/redis -n data
# 5. Wait for ready
kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s
# 6. Verify health
redis-cli -h localhost -p 6379 PING
Emergency Restart
# Force restart
kubectl delete pod redis-0 -n data --force
# Wait for recovery
kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s
# Check data integrity
redis-cli -h localhost -p 6379 DBSIZE
Local Development Restart
# Docker
docker restart redis
# OrbStack
orb restart redis
# Homebrew (macOS)
brew services restart redis
Logs Location
Kubernetes Logs
# Redis logs
kubectl logs -f statefulset/redis -n data
# Filter for warnings/errors
kubectl logs statefulset/redis -n data | grep -E "WARNING|ERROR"
# Export logs
kubectl logs statefulset/redis -n data > redis-logs-$(date +%Y%m%d).txt
Inside Container
# Redis log file (if configured)
kubectl exec -it redis-0 -n data -- cat /var/log/redis/redis.log
Command Monitoring
# Real-time command monitoring (use briefly, impacts performance)
redis-cli -h localhost -p 6379 MONITOR | head -100
# Slow query log
redis-cli -h localhost -p 6379 SLOWLOG GET 25
Scaling
Vertical Scaling
# Increase memory limit
kubectl set resources statefulset/redis -n data \
--limits=cpu=2000m,memory=4Gi \
--requests=cpu=500m,memory=1Gi
# Increase Redis maxmemory
redis-cli -h localhost -p 6379 CONFIG SET maxmemory 3gb
Horizontal Scaling (Read Replicas)
# Add read replica
kubectl scale statefulset/redis-replica -n data --replicas=3
# Configure clients for read-replica routing
# (Application-level configuration required)
Cluster Mode
# For true horizontal scaling, migrate to Redis Cluster
# Requires significant architectural changes
# See Redis Cluster documentation
Scaling Guidelines
| Metric | Threshold | Action |
|---|
| Memory Usage | > 80% | Increase maxmemory, add eviction |
| CPU Usage | > 70% | Scale vertically, add replicas |
| Connections | > 80% maxclients | Increase limit, add replicas |
| Commands/sec | > 100K | Add read replicas |
| Latency P99 | > 5ms | Check slowlog, optimize keys |
Alerts
| Alert | Condition | Runbook Action |
|---|
| RedisDown | Cannot connect for 1min | Emergency Restart |
| MemoryFull | Memory usage 100% | Flush cache, increase memory |
| PersistenceFailure | RDB/AOF save failing | Check disk, fix persistence |
Warning Alerts (Slack)
| Alert | Condition | Runbook Action |
|---|
| HighMemory | Memory > 80% | Review keys, increase limit |
| HighConnections | Connections > 80% | Check for leaks, increase limit |
| SlowCommands | Slowlog entries > 10/min | Optimize commands |
| ReplicationLag | Replica lag > 1s | Check network |
Prometheus Alert Rules
groups:
- name: redis
rules:
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis is down"
runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/redis"
- alert: RedisMemoryHigh
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage high"
- alert: RedisConnectionsHigh
expr: redis_connected_clients / redis_config_maxclients > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Redis connection pool nearly exhausted"
Monitoring Dashboards
- Grafana:
https://grafana.local/d/redis
- Redis Insight:
http://localhost:8001 (if deployed)
- Prometheus:
https://prometheus.local/graph?g0.expr=redis_up
Useful Redis Commands
# Memory analysis
redis-cli -h localhost -p 6379 MEMORY DOCTOR
redis-cli -h localhost -p 6379 MEMORY STATS
# Key analysis
redis-cli -h localhost -p 6379 --bigkeys
redis-cli -h localhost -p 6379 --memkeys
# Server info
redis-cli -h localhost -p 6379 INFO
redis-cli -h localhost -p 6379 INFO replication
redis-cli -h localhost -p 6379 INFO persistence
# Flush specific database (use with caution)
redis-cli -h localhost -p 6379 SELECT 1 && FLUSHDB
# Scan for keys matching pattern
redis-cli -h localhost -p 6379 SCAN 0 MATCH "cache:*" COUNT 100
- On-call: PagerDuty rotation
- Slack: #platform-incidents
- Owner: Platform Team
- PostgreSQL Runbook - Primary data store
- Agent Mesh Runbook - Pub/sub consumer
- Workflow Engine Runbook - Task queue consumer
- Agent Router Runbook - Rate limiting