Oracle VM Disaster Recovery

Last Updated: 2026-02-15
Owner: Infrastructure Team
Review Cycle: Monthly

Overview

This document outlines the disaster recovery strategy for the Oracle VM infrastructure running at 100.103.48.75 (oracle.tailcf98b3.ts.net).

Critical Risk: If Oracle VM disappeared today, you would lose:

✅ Services code: SAFE (8 git repos can be redeployed)
❌ Production data: LOST (all databases, vector stores, logs)
❌ Production configs: LOST (main .env + 11 service-specific configs)
❌ Service orchestration: LOST (main docker-compose.yml not in git)

What Would Be Lost (NOT in Git)

Production Environment Files

CRITICAL - Contains all secrets and production configs

File	Contents	Impact
`/opt/bluefly/.env`	Main env for docker-compose (POSTGRES_PASSWORD, JWT_SECRET, API keys)	CRITICAL - All services fail
`/opt/bluefly/docker-compose.yml`	Service orchestration, network config, volume mappings	CRITICAL - Cannot recreate infrastructure

Service-Specific .env Files (11 files):

/opt/bluefly/agent-protocol/.env.production
/opt/bluefly/agent-chat/infrastructure/.env.production
/opt/bluefly/agent-tracer/.env.{observability,k8s,analytics}
/opt/bluefly/compliance-engine/.env.{phoenix,unified-auth}
/opt/bluefly/agent-router/.env.{litellm,phoenix,vastai}

Production Data Volumes

CRITICAL - All persistent data would be permanently lost

Volume Path	Service	Data Type	Recovery
`/opt/bluefly/data/postgres`	PostgreSQL	All application databases	NONE (unless backed up)
`/opt/bluefly/data/mongodb`	MongoDB	LibreChat conversations, n8n workflows	NONE
`/opt/bluefly/data/qdrant`	Qdrant	Vector embeddings, semantic search indexes	NONE (re-index required)
`/opt/bluefly/data/redis`	Redis	Cache, session data, job queues	ACCEPTABLE (ephemeral)
`/opt/bluefly/data/dragonfly-db`	Dragonfly	Testing artifacts, compliance reports	MODERATE
`/opt/bluefly/data/grafana`	Grafana	Dashboards, alerts, data sources	MODERATE (can recreate)
`/opt/bluefly/data/{loki,tempo,phoenix}`	Observability	Logs, traces	LOW (historical only)
`/opt/bluefly/data/agents`	Agent state	Agent runtime state, task queues	HIGH

What Would Survive (In GitLab)

Service Code (Recoverable)

All 8 core services have oracle-deploy in their .gitlab-ci.yml:

✅ agent-router
✅ agent-mesh
✅ agent-protocol
✅ workflow-engine
✅ agent-tracer
✅ compliance-engine
✅ content-guardian
✅ dragonfly

Recovery: Can be redeployed via CI/CD to new Oracle VM.

Recovery Procedures

Full Disaster Recovery (Estimated: 4 hours)

Prerequisites:

Backups exist in secure location
GitLab access intact
New Oracle VM provisioned

Step 1: Restore Configs (30 minutes)

# SSH to new Oracle VM
ssh new-oracle

# Create directory structure
sudo mkdir -p /opt/bluefly/{data,backups}
sudo chown -R ubuntu:ubuntu /opt/bluefly

# Restore docker-compose.yml and .env
scp ~/backups/oracle-vm/compose/docker-compose.yml new-oracle:/opt/bluefly/
scp ~/backups/oracle-vm/configs/.env.production new-oracle:/opt/bluefly/.env

Step 2: Restore Databases (1 hour)

cd /opt/bluefly
docker-compose up -d postgres redis mongodb qdrant

# Restore Postgres
cat ~/backups/oracle-vm/postgres-YYYYMMDD.sql | docker exec -i postgres psql -U bluefly

# Restore MongoDB
cat ~/backups/oracle-vm/mongodb-YYYYMMDD.archive | docker exec -i mongodb mongorestore --archive

# Restore Qdrant
docker exec qdrant tar xzf - -C / < ~/backups/oracle-vm/qdrant-YYYYMMDD.tar.gz
docker-compose restart qdrant

Step 3: Deploy Services via CI/CD (2 hours)

# Trigger oracle-deploy for each service via GitLab UI or:
glab ci run --project blueflyio/agent-platform/agent-router --branch main --job deploy-to-oracle
# Repeat for all 8 services

Step 4: Start Third-Party Services (30 minutes)

docker-compose up -d librechat langflow n8n grafana loki tempo phoenix

Step 5: Verify (30 minutes)

docker ps --format 'table {{.Names}}\t{{.Status}}'
curl http://localhost:4000/health  # agent-router
curl http://localhost:3005/health  # agent-mesh

Backup Strategy

Daily Automated Backups

Databases:

# Postgres
docker exec postgres pg_dumpall -U bluefly | gzip > /backups/postgres-$(date +%Y%m%d).sql.gz

# MongoDB
docker exec mongodb mongodump --archive --gzip > /backups/mongodb-$(date +%Y%m%d).archive.gz

# Qdrant
docker exec qdrant tar czf - /qdrant/storage > /backups/qdrant-$(date +%Y%m%d).tar.gz

Retention:

Daily backups: 7 days
Weekly backups: 30 days
Monthly backups: 12 months

Sync to NAS:

rsync -avz /opt/bluefly/backups/ blueflynas:/volume1/backups/oracle-vm/

Currently Running Services

Service	Status	Health
Core Platform
agent-router	Running	✅ Healthy
agent-mesh	Running	✅ Healthy
agent-protocol	Running	✅ Healthy
workflow-engine	Running	✅ Healthy
compliance-engine	Running	✅ Healthy
agent-tracer	Running	✅ Healthy
dragonfly	Running	✅ Healthy
Infrastructure
postgres	Running	✅ Healthy
redis	Running	✅ Healthy
qdrant	Running	✅ Healthy
mongodb	Running	✅ Healthy
Observability
grafana	Running	✅ Healthy
loki	Running	⚠️ No health check
tempo	Running	⚠️ No health check
phoenix	Running	❌ Unhealthy
otel-collector	Running	❌ Unhealthy

Configuration Management

Tracked in Git

Location: gitlab_components/infrastructure/oracle-vm/

docker-compose.production.yml - Service orchestration
.env.template - Environment variable template (NO secrets)

Secret Management

Location: Secure encrypted backup (NOT in git)

Main .env file
Service-specific .env files
Database passwords
API keys

Monitoring & Alerts

Health Checks

All services must have health checks defined in docker-compose.yml:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:4000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Services Needing Fixes

❌ phoenix - LLM observability unhealthy
❌ otel-collector - OpenTelemetry collection failing
❌ langflow - AI workflow builder unhealthy

Next Review: 2026-03-15

Oracle VM Disaster Recovery

Oracle VM Disaster Recovery

Overview

What Would Be Lost (NOT in Git)

Production Environment Files

Production Data Volumes

What Would Survive (In GitLab)

Service Code (Recoverable)

Recovery Procedures

Full Disaster Recovery (Estimated: 4 hours)

Backup Strategy

Daily Automated Backups

Currently Running Services

Configuration Management

Tracked in Git

Secret Management

Monitoring & Alerts

Health Checks

Services Needing Fixes

Related Documentation