redis

Redis Runbook

Overview

Purpose: In-memory data store for caching, session management, pub/sub messaging, task queues, and rate limiting across the agent platform.
Port: 6379
Health endpoint: PING command returns PONG
Namespace: data (Kubernetes)
Version: Redis 7+

Dependencies

Persistent Volume (PVC) - For RDB/AOF persistence

Key Namespaces

Prefix	Purpose
`session:*`	User sessions
`cache:*`	Application caches
`queue:*`	Task queues (Bull/Celery)
`rate:*`	Rate limiting counters
`mesh:*`	Agent mesh state
`workflow:*`	Workflow task queues
`lock:*`	Distributed locks

Common Issues

Issue 1: Memory Limit Reached

Symptoms:
- OOM errors in logs
- Write commands failing
- "maxmemory reached" errors
Cause:
- Cache not expiring
- Memory leak (keys not being deleted)
- Insufficient maxmemory setting

Resolution:

# Check memory usage
redis-cli -h localhost -p 6379 INFO memory

# Find large keys
redis-cli -h localhost -p 6379 --bigkeys

# Check keys by pattern
redis-cli -h localhost -p 6379 DBSIZE
redis-cli -h localhost -p 6379 KEYS "cache:*" | wc -l

# Enable memory eviction (if not set)
redis-cli -h localhost -p 6379 CONFIG SET maxmemory-policy allkeys-lru

# Manually expire large cache namespace
redis-cli -h localhost -p 6379 KEYS "cache:old:*" | xargs redis-cli DEL

# Increase maxmemory (requires restart for persistence)
redis-cli -h localhost -p 6379 CONFIG SET maxmemory 2gb

Issue 2: Connection Refused

Symptoms:
- Applications unable to connect
- "Connection refused" errors
- TCP connections timing out
Cause:
- Redis process crashed
- Max connections reached
- Network/firewall issues

Resolution:

# Check if Redis is running
kubectl get pods -n data -l app=redis

# Check Redis logs
kubectl logs -f statefulset/redis -n data

# Check current connections
redis-cli -h localhost -p 6379 CLIENT LIST | wc -l

# Check max connections
redis-cli -h localhost -p 6379 CONFIG GET maxclients

# Kill idle connections
redis-cli -h localhost -p 6379 CLIENT KILL TYPE normal SKIPME yes

# Restart Redis if unresponsive
kubectl rollout restart statefulset/redis -n data

Issue 3: Slow Commands / High Latency

Symptoms:
- Commands taking >10ms
- Application timeouts
- "slowlog" entries accumulating
Cause:
- Large key operations (KEYS, SMEMBERS on large sets)
- Blocking commands
- Memory swapping

Resolution:

# Check slowlog
redis-cli -h localhost -p 6379 SLOWLOG GET 10

# Monitor in real-time
redis-cli -h localhost -p 6379 MONITOR # (Ctrl+C to stop, use briefly)

# Check latency
redis-cli -h localhost -p 6379 --latency

# Check for memory swapping
redis-cli -h localhost -p 6379 INFO | grep swap

# Identify problematic commands
redis-cli -h localhost -p 6379 INFO commandstats | grep "cmdstat_keys\|cmdstat_smembers"

# Replace KEYS with SCAN in application code
# Use SSCAN instead of SMEMBERS for large sets

Issue 4: Data Loss After Restart

Symptoms:
- Keys missing after pod restart
- Sessions expired unexpectedly
- Cache empty after recovery
Cause:
- Persistence not configured
- RDB/AOF corrupted
- PVC not mounted correctly

Resolution:

# Check persistence configuration
redis-cli -h localhost -p 6379 CONFIG GET save
redis-cli -h localhost -p 6379 CONFIG GET appendonly

# Check PVC status
kubectl get pvc -n data -l app=redis

# Verify data directory
kubectl exec -it redis-0 -n data -- ls -la /data

# Enable AOF persistence
redis-cli -h localhost -p 6379 CONFIG SET appendonly yes

# Force RDB save
redis-cli -h localhost -p 6379 BGSAVE

# Check last save status
redis-cli -h localhost -p 6379 LASTSAVE

Issue 5: Pub/Sub Not Working

Symptoms:
- Subscribers not receiving messages
- Agent mesh communication broken
- "No subscribers for channel" in logs
Cause:
- Subscribers disconnected
- Wrong channel name
- Network partitioning

Resolution:

# Check active subscriptions
redis-cli -h localhost -p 6379 PUBSUB NUMSUB channel1 channel2

# List all channels with subscribers
redis-cli -h localhost -p 6379 PUBSUB CHANNELS "*"

# Test pub/sub manually
# Terminal 1: redis-cli SUBSCRIBE test-channel
# Terminal 2: redis-cli PUBLISH test-channel "hello"

# Check client connections
redis-cli -h localhost -p 6379 CLIENT LIST | grep sub

# Restart subscriber applications
kubectl rollout restart deployment/agent-mesh -n agents

Issue 6: Cluster Split (Sentinel/Cluster Mode)

Symptoms:
- Write failures
- Inconsistent reads
- Failover not completing
Cause:
- Network partition between nodes
- Sentinel quorum lost
- Master election failing

Resolution:

# Check Sentinel status (if using Sentinel)
redis-cli -p 26379 SENTINEL master mymaster
redis-cli -p 26379 SENTINEL slaves mymaster

# Check cluster status (if using Redis Cluster)
redis-cli -h localhost -p 6379 CLUSTER INFO
redis-cli -h localhost -p 6379 CLUSTER NODES

# Force failover
redis-cli -p 26379 SENTINEL FAILOVER mymaster

# Reset cluster node
redis-cli -h localhost -p 6379 CLUSTER RESET SOFT

Restart Procedure

Graceful Restart (Recommended)

# 1. Check for critical operations
redis-cli -h localhost -p 6379 CLIENT LIST | grep -c "cmd="

# 2. Force RDB save
redis-cli -h localhost -p 6379 BGSAVE

# 3. Wait for save to complete
while [ $(redis-cli LASTSAVE) -eq $(redis-cli LASTSAVE) ]; do sleep 1; done

# 4. Perform rolling restart
kubectl rollout restart statefulset/redis -n data

# 5. Wait for ready
kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s

# 6. Verify health
redis-cli -h localhost -p 6379 PING

Emergency Restart

# Force restart
kubectl delete pod redis-0 -n data --force

# Wait for recovery
kubectl wait --for=condition=ready pod redis-0 -n data --timeout=120s

# Check data integrity
redis-cli -h localhost -p 6379 DBSIZE

Local Development Restart

# Docker
docker restart redis

# OrbStack
orb restart redis

# Homebrew (macOS)
brew services restart redis

Logs Location

Kubernetes Logs

# Redis logs
kubectl logs -f statefulset/redis -n data

# Filter for warnings/errors
kubectl logs statefulset/redis -n data | grep -E "WARNING|ERROR"

# Export logs
kubectl logs statefulset/redis -n data > redis-logs-$(date +%Y%m%d).txt

Inside Container

# Redis log file (if configured)
kubectl exec -it redis-0 -n data -- cat /var/log/redis/redis.log

Command Monitoring

# Real-time command monitoring (use briefly, impacts performance)
redis-cli -h localhost -p 6379 MONITOR | head -100

# Slow query log
redis-cli -h localhost -p 6379 SLOWLOG GET 25

Scaling

Vertical Scaling

# Increase memory limit
kubectl set resources statefulset/redis -n data \
  --limits=cpu=2000m,memory=4Gi \
  --requests=cpu=500m,memory=1Gi

# Increase Redis maxmemory
redis-cli -h localhost -p 6379 CONFIG SET maxmemory 3gb

Horizontal Scaling (Read Replicas)

# Add read replica
kubectl scale statefulset/redis-replica -n data --replicas=3

# Configure clients for read-replica routing
# (Application-level configuration required)

Cluster Mode

# For true horizontal scaling, migrate to Redis Cluster
# Requires significant architectural changes
# See Redis Cluster documentation

Scaling Guidelines

Metric	Threshold	Action
Memory Usage	> 80%	Increase maxmemory, add eviction
CPU Usage	> 70%	Scale vertically, add replicas
Connections	> 80% maxclients	Increase limit, add replicas
Commands/sec	> 100K	Add read replicas
Latency P99	> 5ms	Check slowlog, optimize keys

Alerts

Critical Alerts (PagerDuty)

Alert	Condition	Runbook Action
RedisDown	Cannot connect for 1min	Emergency Restart
MemoryFull	Memory usage 100%	Flush cache, increase memory
PersistenceFailure	RDB/AOF save failing	Check disk, fix persistence

Warning Alerts (Slack)

Alert	Condition	Runbook Action
HighMemory	Memory > 80%	Review keys, increase limit
HighConnections	Connections > 80%	Check for leaks, increase limit
SlowCommands	Slowlog entries > 10/min	Optimize commands
ReplicationLag	Replica lag > 1s	Check network

Prometheus Alert Rules

groups:
  - name: redis
    rules:
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis is down"
          runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/redis"

      - alert: RedisMemoryHigh
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage high"

      - alert: RedisConnectionsHigh
        expr: redis_connected_clients / redis_config_maxclients > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis connection pool nearly exhausted"

Monitoring Dashboards

Grafana: https://grafana.local/d/redis
Redis Insight: http://localhost:8001 (if deployed)
Prometheus: https://prometheus.local/graph?g0.expr=redis_up

Useful Redis Commands

# Memory analysis
redis-cli -h localhost -p 6379 MEMORY DOCTOR
redis-cli -h localhost -p 6379 MEMORY STATS

# Key analysis
redis-cli -h localhost -p 6379 --bigkeys
redis-cli -h localhost -p 6379 --memkeys

# Server info
redis-cli -h localhost -p 6379 INFO
redis-cli -h localhost -p 6379 INFO replication
redis-cli -h localhost -p 6379 INFO persistence

# Flush specific database (use with caution)
redis-cli -h localhost -p 6379 SELECT 1 && FLUSHDB

# Scan for keys matching pattern
redis-cli -h localhost -p 6379 SCAN 0 MATCH "cache:*" COUNT 100

Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents
Owner: Platform Team

PostgreSQL Runbook - Primary data store
Agent Mesh Runbook - Pub/sub consumer
Workflow Engine Runbook - Task queue consumer
Agent Router Runbook - Rate limiting

redis

Redis Runbook

Overview

Dependencies

Key Namespaces

Common Issues

Issue 1: Memory Limit Reached

Issue 2: Connection Refused

Issue 3: Slow Commands / High Latency

Issue 4: Data Loss After Restart

Issue 5: Pub/Sub Not Working

Issue 6: Cluster Split (Sentinel/Cluster Mode)

Restart Procedure

Graceful Restart (Recommended)

Emergency Restart

Local Development Restart

Logs Location

Kubernetes Logs

Inside Container

Command Monitoring

Scaling

Vertical Scaling

Horizontal Scaling (Read Replicas)

Cluster Mode

Scaling Guidelines

Alerts

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Prometheus Alert Rules

Monitoring Dashboards

Useful Redis Commands

Contacts

Related Runbooks