Oracle VM Disaster Recovery
Oracle VM Disaster Recovery
Last Updated: 2026-02-15
Owner: Infrastructure Team
Review Cycle: Monthly
Overview
This document outlines the disaster recovery strategy for the Oracle VM infrastructure running at 100.103.48.75 (oracle.tailcf98b3.ts.net).
Critical Risk: If Oracle VM disappeared today, you would lose:
- ✅ Services code: SAFE (8 git repos can be redeployed)
- ❌ Production data: LOST (all databases, vector stores, logs)
- ❌ Production configs: LOST (main
.env+ 11 service-specific configs) - ❌ Service orchestration: LOST (main
docker-compose.ymlnot in git)
What Would Be Lost (NOT in Git)
Production Environment Files
CRITICAL - Contains all secrets and production configs
| File | Contents | Impact |
|---|---|---|
/opt/bluefly/.env | Main env for docker-compose (POSTGRES_PASSWORD, JWT_SECRET, API keys) | CRITICAL - All services fail |
/opt/bluefly/docker-compose.yml | Service orchestration, network config, volume mappings | CRITICAL - Cannot recreate infrastructure |
Service-Specific .env Files (11 files):
/opt/bluefly/agent-protocol/.env.production/opt/bluefly/agent-chat/infrastructure/.env.production/opt/bluefly/agent-tracer/.env.{observability,k8s,analytics}/opt/bluefly/compliance-engine/.env.{phoenix,unified-auth}/opt/bluefly/agent-router/.env.{litellm,phoenix,vastai}
Production Data Volumes
CRITICAL - All persistent data would be permanently lost
| Volume Path | Service | Data Type | Recovery |
|---|---|---|---|
/opt/bluefly/data/postgres | PostgreSQL | All application databases | NONE (unless backed up) |
/opt/bluefly/data/mongodb | MongoDB | LibreChat conversations, n8n workflows | NONE |
/opt/bluefly/data/qdrant | Qdrant | Vector embeddings, semantic search indexes | NONE (re-index required) |
/opt/bluefly/data/redis | Redis | Cache, session data, job queues | ACCEPTABLE (ephemeral) |
/opt/bluefly/data/dragonfly-db | Dragonfly | Testing artifacts, compliance reports | MODERATE |
/opt/bluefly/data/grafana | Grafana | Dashboards, alerts, data sources | MODERATE (can recreate) |
/opt/bluefly/data/{loki,tempo,phoenix} | Observability | Logs, traces | LOW (historical only) |
/opt/bluefly/data/agents | Agent state | Agent runtime state, task queues | HIGH |
What Would Survive (In GitLab)
Service Code (Recoverable)
All 8 core services have oracle-deploy in their .gitlab-ci.yml:
- ✅ agent-router
- ✅ agent-mesh
- ✅ agent-protocol
- ✅ workflow-engine
- ✅ agent-tracer
- ✅ compliance-engine
- ✅ content-guardian
- ✅ dragonfly
Recovery: Can be redeployed via CI/CD to new Oracle VM.
Recovery Procedures
Full Disaster Recovery (Estimated: 4 hours)
Prerequisites:
- Backups exist in secure location
- GitLab access intact
- New Oracle VM provisioned
Step 1: Restore Configs (30 minutes)
# SSH to new Oracle VM ssh new-oracle # Create directory structure sudo mkdir -p /opt/bluefly/{data,backups} sudo chown -R ubuntu:ubuntu /opt/bluefly # Restore docker-compose.yml and .env scp ~/backups/oracle-vm/compose/docker-compose.yml new-oracle:/opt/bluefly/ scp ~/backups/oracle-vm/configs/.env.production new-oracle:/opt/bluefly/.env
Step 2: Restore Databases (1 hour)
cd /opt/bluefly docker-compose up -d postgres redis mongodb qdrant # Restore Postgres cat ~/backups/oracle-vm/postgres-YYYYMMDD.sql | docker exec -i postgres psql -U bluefly # Restore MongoDB cat ~/backups/oracle-vm/mongodb-YYYYMMDD.archive | docker exec -i mongodb mongorestore --archive # Restore Qdrant docker exec qdrant tar xzf - -C / < ~/backups/oracle-vm/qdrant-YYYYMMDD.tar.gz docker-compose restart qdrant
Step 3: Deploy Services via CI/CD (2 hours)
# Trigger oracle-deploy for each service via GitLab UI or: glab ci run --project blueflyio/agent-platform/agent-router --branch main --job deploy-to-oracle # Repeat for all 8 services
Step 4: Start Third-Party Services (30 minutes)
docker-compose up -d librechat langflow n8n grafana loki tempo phoenix
Step 5: Verify (30 minutes)
docker ps --format 'table {{.Names}}\t{{.Status}}' curl http://localhost:4000/health # agent-router curl http://localhost:3005/health # agent-mesh
Backup Strategy
Daily Automated Backups
Databases:
# Postgres docker exec postgres pg_dumpall -U bluefly | gzip > /backups/postgres-$(date +%Y%m%d).sql.gz # MongoDB docker exec mongodb mongodump --archive --gzip > /backups/mongodb-$(date +%Y%m%d).archive.gz # Qdrant docker exec qdrant tar czf - /qdrant/storage > /backups/qdrant-$(date +%Y%m%d).tar.gz
Retention:
- Daily backups: 7 days
- Weekly backups: 30 days
- Monthly backups: 12 months
Sync to NAS:
rsync -avz /opt/bluefly/backups/ blueflynas:/volume1/backups/oracle-vm/
Currently Running Services
| Service | Status | Health |
|---|---|---|
| Core Platform | ||
| agent-router | Running | ✅ Healthy |
| agent-mesh | Running | ✅ Healthy |
| agent-protocol | Running | ✅ Healthy |
| workflow-engine | Running | ✅ Healthy |
| compliance-engine | Running | ✅ Healthy |
| agent-tracer | Running | ✅ Healthy |
| dragonfly | Running | ✅ Healthy |
| Infrastructure | ||
| postgres | Running | ✅ Healthy |
| redis | Running | ✅ Healthy |
| qdrant | Running | ✅ Healthy |
| mongodb | Running | ✅ Healthy |
| Observability | ||
| grafana | Running | ✅ Healthy |
| loki | Running | ⚠️ No health check |
| tempo | Running | ⚠️ No health check |
| phoenix | Running | ❌ Unhealthy |
| otel-collector | Running | ❌ Unhealthy |
Configuration Management
Tracked in Git
Location: gitlab_components/infrastructure/oracle-vm/
docker-compose.production.yml- Service orchestration.env.template- Environment variable template (NO secrets)
Secret Management
Location: Secure encrypted backup (NOT in git)
- Main
.envfile - Service-specific
.envfiles - Database passwords
- API keys
Monitoring & Alerts
Health Checks
All services must have health checks defined in docker-compose.yml:
healthcheck: test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:4000/health"] interval: 30s timeout: 10s retries: 3
Services Needing Fixes
- ❌ phoenix - LLM observability unhealthy
- ❌ otel-collector - OpenTelemetry collection failing
- ❌ langflow - AI workflow builder unhealthy
Related Documentation
Next Review: 2026-03-15