qdrant

Qdrant Runbook

Overview

Purpose: Vector database for semantic search, similarity matching, and AI memory retrieval. Stores embeddings for agent memories, document chunks, and knowledge base entries.
Port: 6333 (HTTP/gRPC), 6334 (gRPC only)
Health endpoint: GET /health or GET /readyz
Namespace: data (Kubernetes)
Version: Qdrant 1.7+

Dependencies

Persistent Volume (PVC) - Vector storage
Agent Brain (port 3001) - Primary consumer

Collection Layout

Collection	Purpose	Dimensions
`agent_memory`	Agent long-term memory	1536 (OpenAI) or 1024 (local)
`documents`	Document embeddings for RAG	1536
`knowledge_base`	Structured knowledge entries	1536
`conversation_history`	Chat history embeddings	1536

Common Issues

Issue 1: Collection Not Found

Symptoms:
- 404 errors on vector operations
- "Collection does not exist" in logs
- Agent memory retrieval failing
Cause:
- Collection not created
- Wrong collection name in configuration
- Collection deleted accidentally

Resolution:

# List existing collections
curl http://localhost:6333/collections

# Create missing collection
curl -X PUT http://localhost:6333/collections/agent_memory \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "optimizers_config": {
      "indexing_threshold": 20000
    }
  }'

# Verify collection exists
curl http://localhost:6333/collections/agent_memory

Issue 2: Slow Vector Search

Symptoms:
- Search queries taking >1 second
- High CPU usage during search
- Memory retrieval timeouts
Cause:
- Index not optimized
- Too many vectors in collection
- HNSW index not built

Resolution:

# Check collection status
curl http://localhost:6333/collections/agent_memory

# Force index optimization
curl -X POST http://localhost:6333/collections/agent_memory/index \
  -H "Content-Type: application/json" \
  -d '{
    "field_name": null,
    "wait": true
  }'

# Check if indexing is in progress
curl http://localhost:6333/collections/agent_memory | jq '.result.status'

# Reduce search scope with filters
# (Application-level optimization)

# Scale up resources if collection is large
kubectl set resources deployment/qdrant -n data \
  --limits=cpu=4000m,memory=8Gi

Issue 3: Out of Memory

Symptoms:
- OOMKilled events
- Qdrant crashing during indexing
- "Not enough memory" errors
Cause:
- Large collection exceeding RAM
- HNSW index too large for memory
- Memory mapped files not working

Resolution:

# Check memory usage
kubectl top pods -n data -l app=qdrant

# Check collection size
curl http://localhost:6333/collections/agent_memory | jq '.result.points_count'

# Enable on-disk storage for large collections
curl -X PATCH http://localhost:6333/collections/agent_memory \
  -H "Content-Type: application/json" \
  -d '{
    "optimizers_config": {
      "memmap_threshold": 10000
    }
  }'

# Reduce HNSW memory usage
curl -X PATCH http://localhost:6333/collections/agent_memory \
  -H "Content-Type: application/json" \
  -d '{
    "hnsw_config": {
      "on_disk": true
    }
  }'

# Increase memory limits
kubectl set resources deployment/qdrant -n data \
  --limits=memory=16Gi

Issue 4: Write Failures

Symptoms:
- Upsert operations failing
- "Write ahead log full" errors
- Vector insertions timing out
Cause:
- Disk full
- WAL corruption
- Collection locked for optimization

Resolution:

# Check disk usage
kubectl exec -it qdrant-0 -n data -- df -h /qdrant/storage

# Check collection status
curl http://localhost:6333/collections/agent_memory | jq '.result.status'

# If status is "yellow" or optimizing, wait or
# Clear old snapshots
kubectl exec -it qdrant-0 -n data -- ls -la /qdrant/storage/snapshots
kubectl exec -it qdrant-0 -n data -- rm -rf /qdrant/storage/snapshots/old_*

# Force WAL flush
curl -X POST http://localhost:6333/collections/agent_memory/points/flush

# Restart if WAL corrupted
kubectl rollout restart deployment/qdrant -n data

Issue 5: Embedding Dimension Mismatch

Symptoms:
- "Vector size mismatch" errors
- 400 Bad Request on upsert
- Inconsistent search results
Cause:
- Embedding model changed
- Wrong collection used
- Misconfigured client

Resolution:

# Check expected dimension
curl http://localhost:6333/collections/agent_memory | jq '.result.config.params.vectors.size'

# Verify embedding from application
# (Check agent-brain configuration)

# If model changed, need to recreate collection and reindex
# 1. Backup existing data
curl -X POST "http://localhost:6333/collections/agent_memory/snapshots"

# 2. Delete and recreate with new dimension
curl -X DELETE http://localhost:6333/collections/agent_memory
curl -X PUT http://localhost:6333/collections/agent_memory \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 1024,
      "distance": "Cosine"
    }
  }'

# 3. Trigger reindexing from source data
curl -X POST http://localhost:3001/api/v1/memory/reindex

Issue 6: Cluster Partition (Distributed Mode)

Symptoms:
- Some shards unavailable
- Partial search results
- "Shard not available" errors
Cause:
- Network partition between nodes
- Node crashed
- Replication lag

Resolution:

# Check cluster status
curl http://localhost:6333/cluster

# Check peer status
curl http://localhost:6333/cluster/peers

# Force shard recovery
curl -X POST http://localhost:6333/collections/agent_memory/cluster/recover

# Remove unhealthy peer
curl -X DELETE http://localhost:6333/cluster/peer/{peer_id}

# Restart affected node
kubectl delete pod qdrant-{node} -n data

Restart Procedure

Graceful Restart (Recommended)

# 1. Check for ongoing operations
curl http://localhost:6333/collections | jq '.result.collections[].status'

# 2. Create snapshot for safety
curl -X POST "http://localhost:6333/collections/agent_memory/snapshots"

# 3. Wait for snapshot completion
sleep 30

# 4. Perform rolling restart
kubectl rollout restart deployment/qdrant -n data

# 5. Wait for ready
kubectl wait --for=condition=ready pod -l app=qdrant -n data --timeout=180s

# 6. Verify health
curl http://localhost:6333/health
curl http://localhost:6333/readyz

Emergency Restart

# Force restart
kubectl delete pod qdrant-0 -n data --force

# Wait for recovery
kubectl wait --for=condition=ready pod qdrant-0 -n data --timeout=180s

# Verify collection integrity
curl http://localhost:6333/collections/agent_memory | jq '.result.status'

# If corrupted, restore from snapshot
curl -X PUT "http://localhost:6333/collections/agent_memory/snapshots/recover" \
  -H "Content-Type: application/json" \
  -d '{"location": "/qdrant/storage/snapshots/agent_memory-SNAPSHOT_ID.snapshot"}'

Local Development Restart

# Docker
docker restart qdrant

# OrbStack
orb restart qdrant

# Using docker-compose
docker-compose restart qdrant

Logs Location

Kubernetes Logs

# Qdrant logs
kubectl logs -f deployment/qdrant -n data

# Filter for errors
kubectl logs deployment/qdrant -n data | grep -E "ERROR|WARN"

# Export logs
kubectl logs deployment/qdrant -n data > qdrant-logs-$(date +%Y%m%d).txt

Inside Container

# Check Qdrant storage directory
kubectl exec -it qdrant-0 -n data -- ls -la /qdrant/storage

# Check WAL files
kubectl exec -it qdrant-0 -n data -- ls -la /qdrant/storage/collections/agent_memory/0/wal/

Telemetry

# Get telemetry data
curl http://localhost:6333/telemetry

# Get detailed metrics
curl http://localhost:6333/metrics

Scaling

Vertical Scaling

# Increase resources for larger collections
kubectl set resources deployment/qdrant -n data \
  --limits=cpu=8000m,memory=32Gi \
  --requests=cpu=2000m,memory=8Gi

Horizontal Scaling (Distributed Mode)

# Scale to multiple nodes (requires cluster configuration)
kubectl scale statefulset/qdrant -n data --replicas=3

# Configure sharding
curl -X PUT http://localhost:6333/collections/agent_memory \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 1536,
      "distance": "Cosine"
    },
    "shard_number": 3,
    "replication_factor": 2
  }'

Storage Scaling

# Expand PVC
kubectl patch pvc qdrant-storage -n data -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

Scaling Guidelines

Metric	Threshold	Action
Memory Usage	> 80%	Increase memory, enable memmap
CPU Usage	> 70%	Add replicas, increase CPU
Search Latency P99	> 500ms	Optimize index, add shards
Disk Usage	> 80%	Expand storage
Points Count	> 10M	Add shards, enable on-disk

Alerts

Critical Alerts (PagerDuty)

Alert	Condition	Runbook Action
QdrantDown	Cannot connect for 2min	Emergency Restart
CollectionUnavailable	Collection status "red"	Recover from snapshot
DiskFull	Disk > 95%	Expand storage, cleanup

Warning Alerts (Slack)

Alert	Condition	Runbook Action
HighMemory	Memory > 80%	Enable memmap, increase limit
SlowSearch	P99 > 500ms	Optimize index
IndexingStuck	Optimizing > 1hr	Check resources
ShardUnhealthy	Shard status != green	Recover shard

Prometheus Alert Rules

groups:
  - name: qdrant
    rules:
      - alert: QdrantDown
        expr: up{job="qdrant"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Qdrant is down"
          runbook_url: "https://gitlab.com/blueflyio/agent-platform/technical-docs/-/wikis/runbooks/qdrant"

      - alert: QdrantHighMemory
        expr: qdrant_memory_usage_bytes / qdrant_memory_limit_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant memory usage high"

      - alert: QdrantSlowSearch
        expr: histogram_quantile(0.99, rate(qdrant_search_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Qdrant search latency high"

Monitoring Dashboards

Grafana: https://grafana.local/d/qdrant
Qdrant Dashboard: http://localhost:6333/dashboard (built-in)
Prometheus: https://prometheus.local/graph?g0.expr=up{job="qdrant"}

Backup & Recovery

Create Snapshot

# Snapshot specific collection
curl -X POST "http://localhost:6333/collections/agent_memory/snapshots"

# List snapshots
curl http://localhost:6333/collections/agent_memory/snapshots

# Download snapshot
curl -O http://localhost:6333/collections/agent_memory/snapshots/{snapshot_name}

Restore from Snapshot

# Restore collection from snapshot
curl -X PUT "http://localhost:6333/collections/agent_memory/snapshots/recover" \
  -H "Content-Type: application/json" \
  -d '{
    "location": "file:///qdrant/storage/snapshots/agent_memory-SNAPSHOT.snapshot"
  }'

# Or from URL
curl -X PUT "http://localhost:6333/collections/agent_memory/snapshots/recover" \
  -H "Content-Type: application/json" \
  -d '{
    "location": "https://backup-storage/snapshots/agent_memory.snapshot"
  }'

Useful API Endpoints

# Health check
curl http://localhost:6333/health
curl http://localhost:6333/readyz

# Collection info
curl http://localhost:6333/collections
curl http://localhost:6333/collections/{name}
curl http://localhost:6333/collections/{name}/points/count

# Search example
curl -X POST http://localhost:6333/collections/agent_memory/points/search \
  -H "Content-Type: application/json" \
  -d '{
    "vector": [0.1, 0.2, ...],
    "limit": 10
  }'

# Cluster info (distributed mode)
curl http://localhost:6333/cluster
curl http://localhost:6333/cluster/peers

# Metrics (Prometheus format)
curl http://localhost:6333/metrics

Contacts

On-call: PagerDuty rotation
Slack: #platform-incidents
Owner: AI/ML Team

Agent Brain Runbook - Primary consumer
PostgreSQL Runbook - Structured data
Redis Runbook - Caching layer

qdrant

Qdrant Runbook

Overview

Dependencies

Collection Layout

Common Issues

Issue 1: Collection Not Found

Issue 2: Slow Vector Search

Issue 3: Out of Memory

Issue 4: Write Failures

Issue 5: Embedding Dimension Mismatch

Issue 6: Cluster Partition (Distributed Mode)

Restart Procedure

Graceful Restart (Recommended)

Emergency Restart

Local Development Restart

Logs Location

Kubernetes Logs

Inside Container

Telemetry

Scaling

Vertical Scaling

Horizontal Scaling (Distributed Mode)

Storage Scaling

Scaling Guidelines

Alerts

Critical Alerts (PagerDuty)

Warning Alerts (Slack)

Prometheus Alert Rules

Monitoring Dashboards

Backup & Recovery

Create Snapshot

Restore from Snapshot

Useful API Endpoints

Contacts

Related Runbooks